SlideShare une entreprise Scribd logo
1  sur  34
AA-Sort: A New Parallel Sorting
Algorithm for Multi-Core SIMD
Processors
By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami
Presented By: M. Edirisinghe, H. Nawarathna
Content

• Introduction
• SIMD instruction set
• AA-sort algorithm
• In-core algorithm
• Out-of-core algorithm
• Sorting scheme in AA-sort
• Experimental results
Introduction
• High-performance processors provide multiple
hardware threads within one physical
processor with multiple cores and
simultaneous multithreading
• Many processors provide Single Instruction
Multiple Data (SIMD) instructions

3
SIMD Instructions
• Advantages:
– Data parallelism
– Reduce the number of conditional branches in
programs (can use vector compare and vector
select instead)

5
SIMD Instruction Set
• Used Vector Multimedia eXtension (VMX or
AltiVec) instructions
• Provides a set of 128 bit vector registers
– Use four 32 bit values

• Useful VMX instructions for sorting:
– Vector Compare
– Vector Selected
– Vector Permutation
6
Sorting Algorithms and SIMD
• Many sorting algorithms require unaligned or
element wise memory access (Eg: quicksort)
• It incur additional overhead and attenuate the
benefits of SIMD instructions

7
Paper’s Contribution
• Propose Aligned-Access sort (AA-sort), a new
parallel sorting algorithm suitable for
exploiting both SIMD instructions and thread
level parallelism available on today’s multi
core processors with computational
complexity of O(N log(N)

8
AA-Sort Algorithm
• Assumptions:
– First element of the array to be sorted is
aligned on a 128 bit boundary
– Number of elements in the array, N, is a
multiple of four

9
AA-Sort Algorithm
• Array of integer values a[N] is equivalent to an
array of vector integers va[N/4]

10
AA-Sort Algorithm
• Consist of 2 algorithms:
1. In-core sorting algorithm
2. Out-of-core sorting algorithm

• Phases of execution:
–Divide all of the data into blocks that fit into the
cache of the processor
–Sort each block with the in-core sorting algorithm
–Merge the sorted blocks with the out-of-core
sorting algorithm
11
Combsort
• Extension to bubble sort (kill turtles-lower
values in the end)
• Compares and swaps non-adjacent elements
• Improves performance
• Computational complexity N log (N) average
• Problems with SIMD instructions:
– Unaligned memory access
– Loop-carried dependencies
12
Combsort
In-Core Algorithm
• Execution steps:
1. Sort values within each vector in ascending
order
2. Execute combsort to sort the values into the
transposed order

14
In-Core Algorithm
• Use extended Combsort

15
In-Core Algorithm
3. Reorder the values from the transposed order
into the original order

16
In-Core Algorithm
• All 3 steps can be executed using SIMD
instructions without unaligned memory access
• Computational complexity dominated by step
2
– Average O(N log N)
– Worst case O(N^2)

• Poor memory access locality
– Performance degrade if the data cannot fit into
the cache of the processor
17
Out of core Algorithm
• Used to merge two sorted vectors
– a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted
– c = [b:a] = merge and sort (a, b)
sorted
a

a0

a1

a2

a3

sorted
b

b0

b1

b2

b3

[b:a] = vector_merge(a,b)
c0

c1

c2

c3

c4

c5

c6

c7

sorted
18
Dataflow of Merge
sorted
a0

min00

a1

<

a2

max00

min11

sorted

a3

<

b0

max11

min22

<

<

lg(P + 1) stages,
P – No of elements in a vector

b1

<

b2

max22

min33

b3

<

max33

<

<

<

Here P = 4
lg(P + 1) = 3

19
Merge Operation

20
Out of core Algorithm
• No unaligned memory accesses
• Better memory access locality compared with
in-core sorting algorithm
– Higher performance when data cannot fit in the
cache

21
Overall AA Sort Scheme
• Divide all of the data to be sorted into blocks
that fit in the cache or the local memory of
the processor
• Sort each block with the in-core sorting
algorithm in parallel using multiple threads,
where each thread processes an independent
block.
• Merge the sorted blocks with the out-of-core
sorting algorithm using multiple threads
22
Overall AA Sort Scheme Contd.
No of elements of data
No of elements per block
No of blocks

=N
=B
= (N/B)

Considering In-core sorting phase
Computational time for the in-core sorting of each block proportional
to B log(B)
Complexity of in-core sorting
= O(N)
Considering out-of-core sorting phase
Merging sorted blocks in out-of-core sorting involves log(N/B) stages
Computational complexity of each stage = O(N)
Complexity of out-of-core sorting
= O(N log(N))
Hence,
Computational complexity of entire AA-sort = O(N log(N))
23
Overall AA Sort Scheme Contd.

An example of the entire AA-sort process,
where number of blocks (N/B) = 8 and the number of threads = 4

24
Experimental Setup
• PowerPC 970MP System
– Two 2.5 GHz dual-core processors
– 8GB system memory
– Each core had 1MB L2 cache memory
– Linux kernel 2.6.20

• System with Cell BE processors
– Two 2.4 GHz processors
– 1GB system memory
– Only SPE cores were used (16 SPE cores with
256KB local memory each)
– Linux kernel 2.6.15
25
Implementation
• Half of the size of L2 cache as the block size
– 512KB (128K of 32 bit values) on PowerPC 970MP
– 128KB (32K of 32 bit values) on the SPE

• Shrink factor – 1.28
• Multiway merge technique with out-of-core
sorting
– 4 way merge
– Number or merging stages reduced from log2(N/B)
to log4(N/B)
26
Effects of Using SIMD Instructions

Branch misprediction rate.

Acceleration by SIMD
instructions for sorting 16 K random
integers on one core of PowerPC
970MP

27
Performance for 32 bit Integers

Performance of sequential version of each algorithm on a PowerPC
970MP core for sorting random 32-bit integers with various data sizes.
28
Performance for 32 bit Integers Contd.

Performance
comparison on one
PowerPC 970MP core
for various input
datasets with 32
million integers.

29
Performance for 32 bit Integers Contd.

The execution time of parallel versions of AA-sort and GPUTeraSort on
up to 4 cores of PowerPC 970MP.
30
Performance for 32 bit Integers Contd.

Scalability with increasing number of cores on Cell BE for 32 million
integers
31
Conclusions
• Describes a new parallel sorting algorithm
called Aligned Access Sort
• The algorithm does not involve any unaligned
memory accesses
• Evaluated on PowerPC 970MP and Cell
Broadband Engine Processors
• Demonstrated better scalability and
performance in both sequential and parallel
versions
32
Conclusions Contd.
• Evaluation was performed only on 32 bit integers
• Performance comparison was performed on
limited number of architectures
– Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core
SIMD CPU Architecture”, Applications Research Lab, Corporate Technology
Group, Intel Corporation, August 2008, Auckland, New Zealand

• Does not discuss how multiple threads cooperate
on one merge operation when number of blocks
becomes smaller than number of threads

33
Thank You.

34

Contenu connexe

Tendances

System On Chip
System On ChipSystem On Chip
System On Chipanishgoel
 
Report-Implementation of Quantum Gates using Verilog
Report-Implementation of Quantum Gates using VerilogReport-Implementation of Quantum Gates using Verilog
Report-Implementation of Quantum Gates using VerilogShashank Kumar
 
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) techniqueImplementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) techniqueGrace Abraham
 
Solving the traveling salesman problem by genetic algorithm
Solving the traveling salesman problem by genetic algorithmSolving the traveling salesman problem by genetic algorithm
Solving the traveling salesman problem by genetic algorithmAlex Bidanets
 
EEL316: CDMA with DSSS
EEL316: CDMA with DSSSEEL316: CDMA with DSSS
EEL316: CDMA with DSSSUmang Gupta
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notesDr.YNM
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problemsAnish Gupta
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveJason Shih
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRUananth
 
sigma delta converters
sigma delta converterssigma delta converters
sigma delta convertersmkalaiece
 

Tendances (20)

System On Chip
System On ChipSystem On Chip
System On Chip
 
GSM channels
GSM channelsGSM channels
GSM channels
 
Cellular concepts
Cellular conceptsCellular concepts
Cellular concepts
 
ATM
ATMATM
ATM
 
Report-Implementation of Quantum Gates using Verilog
Report-Implementation of Quantum Gates using VerilogReport-Implementation of Quantum Gates using Verilog
Report-Implementation of Quantum Gates using Verilog
 
Gsm air interface
Gsm air interface Gsm air interface
Gsm air interface
 
Actel fpga
Actel fpgaActel fpga
Actel fpga
 
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) techniqueImplementation of 1 bit full adder using gate diffusion input (gdi) technique
Implementation of 1 bit full adder using gate diffusion input (gdi) technique
 
Solving the traveling salesman problem by genetic algorithm
Solving the traveling salesman problem by genetic algorithmSolving the traveling salesman problem by genetic algorithm
Solving the traveling salesman problem by genetic algorithm
 
Vlsi design-styles
Vlsi design-stylesVlsi design-styles
Vlsi design-styles
 
Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...Introduction to Genetic algorithm and its significance in VLSI design and aut...
Introduction to Genetic algorithm and its significance in VLSI design and aut...
 
EEL316: CDMA with DSSS
EEL316: CDMA with DSSSEEL316: CDMA with DSSS
EEL316: CDMA with DSSS
 
Vlsi physical design-notes
Vlsi physical design-notesVlsi physical design-notes
Vlsi physical design-notes
 
Vlsi
VlsiVlsi
Vlsi
 
xilinx fpga problems
xilinx fpga problemsxilinx fpga problems
xilinx fpga problems
 
High performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspectiveHigh performance computing - building blocks, production & perspective
High performance computing - building blocks, production & perspective
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Mobile computing unit-I-notes 07.01.2020
Mobile computing unit-I-notes 07.01.2020Mobile computing unit-I-notes 07.01.2020
Mobile computing unit-I-notes 07.01.2020
 
3D scanner using kinect
3D scanner using kinect3D scanner using kinect
3D scanner using kinect
 
sigma delta converters
sigma delta converterssigma delta converters
sigma delta converters
 

En vedette

Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithmRicha Kumari
 
Parallel sorting Algorithms
Parallel  sorting AlgorithmsParallel  sorting Algorithms
Parallel sorting AlgorithmsGARIMA SHAKYA
 
Different Sorting tecniques in Data Structure
Different Sorting tecniques in Data StructureDifferent Sorting tecniques in Data Structure
Different Sorting tecniques in Data StructureTushar Gonawala
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Saliya Ekanayake
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Sorting Algorithm
Sorting AlgorithmSorting Algorithm
Sorting AlgorithmAl Amin
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithmsguest084d20
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm ModelsMartin Coronel
 

En vedette (11)

Parallel sorting algorithm
Parallel sorting algorithmParallel sorting algorithm
Parallel sorting algorithm
 
Parallel sorting
Parallel sortingParallel sorting
Parallel sorting
 
Parallel sorting Algorithms
Parallel  sorting AlgorithmsParallel  sorting Algorithms
Parallel sorting Algorithms
 
AA-sort with SSE4.1
AA-sort with SSE4.1AA-sort with SSE4.1
AA-sort with SSE4.1
 
Different Sorting tecniques in Data Structure
Different Sorting tecniques in Data StructureDifferent Sorting tecniques in Data Structure
Different Sorting tecniques in Data Structure
 
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...Java Thread and Process Performance for Parallel Machine Learning on Multicor...
Java Thread and Process Performance for Parallel Machine Learning on Multicor...
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Sorting Algorithm
Sorting AlgorithmSorting Algorithm
Sorting Algorithm
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
 
Parallel Algorithm Models
Parallel Algorithm ModelsParallel Algorithm Models
Parallel Algorithm Models
 

Similaire à Aa sort-v4

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projectsawan2008
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...James McCombs
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Bruno Castelucci
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsSabidur Rahman
 
Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2Premier Farnell
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...Edge AI and Vision Alliance
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD) Ali Raza
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Queryoysteing
 
feedback_optimizations_v2
feedback_optimizations_v2feedback_optimizations_v2
feedback_optimizations_v2Ani Sridhar
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerZahra Sadeghi
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryBudditha Hettige
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptRaJibRaju3
 
Computer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPTComputer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPTChetanNaikJECE
 

Similaire à Aa sort-v4 (20)

Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Summary Of Course Projects
Summary Of Course ProjectsSummary Of Course Projects
Summary Of Course Projects
 
lect13_programmable_dp.pptx
lect13_programmable_dp.pptxlect13_programmable_dp.pptx
lect13_programmable_dp.pptx
 
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
Performance Benchmarking of the R Programming Environment on the Stampede 1.5...
 
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
Revisão: Forwarding Metamorphosis: Fast Programmable Match-Action Processing ...
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
Approximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithmsApproximation techniques used for general purpose algorithms
Approximation techniques used for general purpose algorithms
 
Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2Blackfin Processor Core Architecture Part 2
Blackfin Processor Core Architecture Part 2
 
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio..."Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
"Quantizing Deep Networks for Efficient Inference at the Edge," a Presentatio...
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Parallel Processors (SIMD)
Parallel Processors (SIMD) Parallel Processors (SIMD)
Parallel Processors (SIMD)
 
Processors selection
Processors selectionProcessors selection
Processors selection
 
POLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel QueryPOLARDB for MySQL - Parallel Query
POLARDB for MySQL - Parallel Query
 
feedback_optimizations_v2
feedback_optimizations_v2feedback_optimizations_v2
feedback_optimizations_v2
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
 
Computer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary MemoryComputer System Architecture Lecture Note 8.1 primary Memory
Computer System Architecture Lecture Note 8.1 primary Memory
 
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
Hardware Implementation of Tactile Data Processing Methods for the Reconstruc...
 
Lecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.pptLecutre-6 Datapath Design.ppt
Lecutre-6 Datapath Design.ppt
 
module01.ppt
module01.pptmodule01.ppt
module01.ppt
 
Computer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPTComputer organization & ARM microcontrollers module 3 PPT
Computer organization & ARM microcontrollers module 3 PPT
 

Dernier

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 

Dernier (20)

The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 

Aa sort-v4

  • 1. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors By: H.Inoue, T. Moriyama, H. Komatsu, T. Nakatami Presented By: M. Edirisinghe, H. Nawarathna
  • 2. Content • Introduction • SIMD instruction set • AA-sort algorithm • In-core algorithm • Out-of-core algorithm • Sorting scheme in AA-sort • Experimental results
  • 3. Introduction • High-performance processors provide multiple hardware threads within one physical processor with multiple cores and simultaneous multithreading • Many processors provide Single Instruction Multiple Data (SIMD) instructions 3
  • 4.
  • 5. SIMD Instructions • Advantages: – Data parallelism – Reduce the number of conditional branches in programs (can use vector compare and vector select instead) 5
  • 6. SIMD Instruction Set • Used Vector Multimedia eXtension (VMX or AltiVec) instructions • Provides a set of 128 bit vector registers – Use four 32 bit values • Useful VMX instructions for sorting: – Vector Compare – Vector Selected – Vector Permutation 6
  • 7. Sorting Algorithms and SIMD • Many sorting algorithms require unaligned or element wise memory access (Eg: quicksort) • It incur additional overhead and attenuate the benefits of SIMD instructions 7
  • 8. Paper’s Contribution • Propose Aligned-Access sort (AA-sort), a new parallel sorting algorithm suitable for exploiting both SIMD instructions and thread level parallelism available on today’s multi core processors with computational complexity of O(N log(N) 8
  • 9. AA-Sort Algorithm • Assumptions: – First element of the array to be sorted is aligned on a 128 bit boundary – Number of elements in the array, N, is a multiple of four 9
  • 10. AA-Sort Algorithm • Array of integer values a[N] is equivalent to an array of vector integers va[N/4] 10
  • 11. AA-Sort Algorithm • Consist of 2 algorithms: 1. In-core sorting algorithm 2. Out-of-core sorting algorithm • Phases of execution: –Divide all of the data into blocks that fit into the cache of the processor –Sort each block with the in-core sorting algorithm –Merge the sorted blocks with the out-of-core sorting algorithm 11
  • 12. Combsort • Extension to bubble sort (kill turtles-lower values in the end) • Compares and swaps non-adjacent elements • Improves performance • Computational complexity N log (N) average • Problems with SIMD instructions: – Unaligned memory access – Loop-carried dependencies 12
  • 14. In-Core Algorithm • Execution steps: 1. Sort values within each vector in ascending order 2. Execute combsort to sort the values into the transposed order 14
  • 15. In-Core Algorithm • Use extended Combsort 15
  • 16. In-Core Algorithm 3. Reorder the values from the transposed order into the original order 16
  • 17. In-Core Algorithm • All 3 steps can be executed using SIMD instructions without unaligned memory access • Computational complexity dominated by step 2 – Average O(N log N) – Worst case O(N^2) • Poor memory access locality – Performance degrade if the data cannot fit into the cache of the processor 17
  • 18. Out of core Algorithm • Used to merge two sorted vectors – a = [a3:a2:a1:a0], b = [b3:b2:b1:b0] are sorted – c = [b:a] = merge and sort (a, b) sorted a a0 a1 a2 a3 sorted b b0 b1 b2 b3 [b:a] = vector_merge(a,b) c0 c1 c2 c3 c4 c5 c6 c7 sorted 18
  • 19. Dataflow of Merge sorted a0 min00 a1 < a2 max00 min11 sorted a3 < b0 max11 min22 < < lg(P + 1) stages, P – No of elements in a vector b1 < b2 max22 min33 b3 < max33 < < < Here P = 4 lg(P + 1) = 3 19
  • 21. Out of core Algorithm • No unaligned memory accesses • Better memory access locality compared with in-core sorting algorithm – Higher performance when data cannot fit in the cache 21
  • 22. Overall AA Sort Scheme • Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor • Sort each block with the in-core sorting algorithm in parallel using multiple threads, where each thread processes an independent block. • Merge the sorted blocks with the out-of-core sorting algorithm using multiple threads 22
  • 23. Overall AA Sort Scheme Contd. No of elements of data No of elements per block No of blocks =N =B = (N/B) Considering In-core sorting phase Computational time for the in-core sorting of each block proportional to B log(B) Complexity of in-core sorting = O(N) Considering out-of-core sorting phase Merging sorted blocks in out-of-core sorting involves log(N/B) stages Computational complexity of each stage = O(N) Complexity of out-of-core sorting = O(N log(N)) Hence, Computational complexity of entire AA-sort = O(N log(N)) 23
  • 24. Overall AA Sort Scheme Contd. An example of the entire AA-sort process, where number of blocks (N/B) = 8 and the number of threads = 4 24
  • 25. Experimental Setup • PowerPC 970MP System – Two 2.5 GHz dual-core processors – 8GB system memory – Each core had 1MB L2 cache memory – Linux kernel 2.6.20 • System with Cell BE processors – Two 2.4 GHz processors – 1GB system memory – Only SPE cores were used (16 SPE cores with 256KB local memory each) – Linux kernel 2.6.15 25
  • 26. Implementation • Half of the size of L2 cache as the block size – 512KB (128K of 32 bit values) on PowerPC 970MP – 128KB (32K of 32 bit values) on the SPE • Shrink factor – 1.28 • Multiway merge technique with out-of-core sorting – 4 way merge – Number or merging stages reduced from log2(N/B) to log4(N/B) 26
  • 27. Effects of Using SIMD Instructions Branch misprediction rate. Acceleration by SIMD instructions for sorting 16 K random integers on one core of PowerPC 970MP 27
  • 28. Performance for 32 bit Integers Performance of sequential version of each algorithm on a PowerPC 970MP core for sorting random 32-bit integers with various data sizes. 28
  • 29. Performance for 32 bit Integers Contd. Performance comparison on one PowerPC 970MP core for various input datasets with 32 million integers. 29
  • 30. Performance for 32 bit Integers Contd. The execution time of parallel versions of AA-sort and GPUTeraSort on up to 4 cores of PowerPC 970MP. 30
  • 31. Performance for 32 bit Integers Contd. Scalability with increasing number of cores on Cell BE for 32 million integers 31
  • 32. Conclusions • Describes a new parallel sorting algorithm called Aligned Access Sort • The algorithm does not involve any unaligned memory accesses • Evaluated on PowerPC 970MP and Cell Broadband Engine Processors • Demonstrated better scalability and performance in both sequential and parallel versions 32
  • 33. Conclusions Contd. • Evaluation was performed only on 32 bit integers • Performance comparison was performed on limited number of architectures – Jatin Chhugani et al.,” Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture”, Applications Research Lab, Corporate Technology Group, Intel Corporation, August 2008, Auckland, New Zealand • Does not discuss how multiple threads cooperate on one merge operation when number of blocks becomes smaller than number of threads 33