The document proposes a new parallel sorting algorithm called AA-sort that is suitable for multi-core SIMD processors. AA-sort avoids unaligned memory accesses to fully utilize SIMD instructions. It sorts data in-core using an optimized combsort and then merges sorted blocks out-of-core in parallel. Experimental results on PowerPC and Cell processors show AA-sort has better scalability and performance than other algorithms both sequentially and in parallel.
9548086042 for call girls in Indira Nagar with room service
Aa sort-v4
1. AA-Sort: A New Parallel Sorting
Algorithm for Multi-Core SIMD
Processors
By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami
Presented By: M. Edirisinghe, H. Nawarathna
3. Introduction
• High-performance processors provide multiple
hardware threads within one physical
processor with multiple cores and
simultaneous multithreading
• Many processors provide Single Instruction
Multiple Data (SIMD) instructions
3
4.
5. SIMD Instructions
• Advantages:
– Data parallelism
– Reduce the number of conditional branches in
programs (can use vector compare and vector
select instead)
5
6. SIMD Instruction Set
• Used Vector Multimedia eXtension (VMX or
AltiVec) instructions
• Provides a set of 128 bit vector registers
– Use four 32 bit values
• Useful VMX instructions for sorting:
– Vector Compare
– Vector Selected
– Vector Permutation
6
7. Sorting Algorithms and SIMD
• Many sorting algorithms require unaligned or
element wise memory access (Eg: quicksort)
• It incur additional overhead and attenuate the
benefits of SIMD instructions
7
8. Paper’s Contribution
• Propose Aligned-Access sort (AA-sort), a new
parallel sorting algorithm suitable for
exploiting both SIMD instructions and thread
level parallelism available on today’s multi
core processors with computational
complexity of O(N log(N)
8
9. AA-Sort Algorithm
• Assumptions:
– First element of the array to be sorted is
aligned on a 128 bit boundary
– Number of elements in the array, N, is a
multiple of four
9
10. AA-Sort Algorithm
• Array of integer values a[N] is equivalent to an
array of vector integers va[N/4]
10
11. AA-Sort Algorithm
• Consist of 2 algorithms:
1. In-core sorting algorithm
2. Out-of-core sorting algorithm
• Phases of execution:
–Divide all of the data into blocks that fit into the
cache of the processor
–Sort each block with the in-core sorting algorithm
–Merge the sorted blocks with the out-of-core
sorting algorithm
11
12. Combsort
• Extension to bubble sort (kill turtles-lower
values in the end)
• Compares and swaps non-adjacent elements
• Improves performance
• Computational complexity N log (N) average
• Problems with SIMD instructions:
– Unaligned memory access
– Loop-carried dependencies
12
14. In-Core Algorithm
• Execution steps:
1. Sort values within each vector in ascending
order
2. Execute combsort to sort the values into the
transposed order
14
17. In-Core Algorithm
• All 3 steps can be executed using SIMD
instructions without unaligned memory access
• Computational complexity dominated by step
2
– Average O(N log N)
– Worst case O(N^2)
• Poor memory access locality
– Performance degrade if the data cannot fit into
the cache of the processor
17
18. Out of core Algorithm
• Used to merge two sorted vectors
– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted
– c = [b:a] = merge and sort (a, b)
sorted
a
a0
a1
a2
a3
sorted
b
b0
b1
b2
b3
[b:a] = vector_merge(a,b)
c0
c1
c2
c3
c4
c5
c6
c7
sorted
18
21. Out of core Algorithm
• No unaligned memory accesses
• Better memory access locality compared with
in-core sorting algorithm
– Higher performance when data cannot fit in the
cache
21
22. Overall AA Sort Scheme
• Divide all of the data to be sorted into blocks
that fit in the cache or the local memory of
the processor
• Sort each block with the in-core sorting
algorithm in parallel using multiple threads,
where each thread processes an independent
block.
• Merge the sorted blocks with the out-of-core
sorting algorithm using multiple threads
22
23. Overall AA Sort Scheme Contd.
No of elements of data
No of elements per block
No of blocks
=N
=B
= (N/B)
Considering In-core sorting phase
Computational time for the in-core sorting of each block proportional
to B log(B)
Complexity of in-core sorting
= O(N)
Considering out-of-core sorting phase
Merging sorted blocks in out-of-core sorting involves log(N/B) stages
Computational complexity of each stage = O(N)
Complexity of out-of-core sorting
= O(N log(N))
Hence,
Computational complexity of entire AA-sort = O(N log(N))
23
24. Overall AA Sort Scheme Contd.
An example of the entire AA-sort process,
where number of blocks (N/B) = 8 and the number of threads = 4
24
25. Experimental Setup
• PowerPC 970MP System
– Two 2.5 GHz dual-core processors
– 8GB system memory
– Each core had 1MB L2 cache memory
– Linux kernel 2.6.20
• System with Cell BE processors
– Two 2.4 GHz processors
– 1GB system memory
– Only SPE cores were used (16 SPE cores with
256KB local memory each)
– Linux kernel 2.6.15
25
26. Implementation
• Half of the size of L2 cache as the block size
– 512KB (128K of 32 bit values) on PowerPC 970MP
– 128KB (32K of 32 bit values) on the SPE
• Shrink factor – 1.28
• Multiway merge technique with out-of-core
sorting
– 4 way merge
– Number or merging stages reduced from log2(N/B)
to log4(N/B)
26
27. Effects of Using SIMD Instructions
Branch misprediction rate.
Acceleration by SIMD
instructions for sorting 16 K random
integers on one core of PowerPC
970MP
27
28. Performance for 32 bit Integers
Performance of sequential version of each algorithm on a PowerPC
970MP core for sorting random 32-bit integers with various data sizes.
28
29. Performance for 32 bit Integers Contd.
Performance
comparison on one
PowerPC 970MP core
for various input
datasets with 32
million integers.
29
30. Performance for 32 bit Integers Contd.
The execution time of parallel versions of AA-sort and GPUTeraSort on
up to 4 cores of PowerPC 970MP.
30
31. Performance for 32 bit Integers Contd.
Scalability with increasing number of cores on Cell BE for 32 million
integers
31
32. Conclusions
• Describes a new parallel sorting algorithm
called Aligned Access Sort
• The algorithm does not involve any unaligned
memory accesses
• Evaluated on PowerPC 970MP and Cell
Broadband Engine Processors
• Demonstrated better scalability and
performance in both sequential and parallel
versions
32
33. Conclusions Contd.
• Evaluation was performed only on 32 bit integers
• Performance comparison was performed on
limited number of architectures
– Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core
SIMD CPU Architecture”, Applications Research Lab, Corporate Technology
Group, Intel Corporation, August 2008, Auckland, New Zealand
• Does not discuss how multiple threads cooperate
on one merge operation when number of blocks
becomes smaller than number of threads
33