SlideShare une entreprise Scribd logo
1  sur  24
Florian Wende
Zuse-Institute Berlin
Connected Component
Labeling on Xeon Phi
Parallelization & Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
Connected Component Labeling
Suppose we are given the following image . . .
. . . and we are to assign unique labels to different connected regions!
Connected Component Labeling
wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
. . . and we are to assign unique labels to different connected regions!
. . . In parallel?
 Computer Vision
Detect connected regions in images
 Computational Physics
Cluster algorithms for the Ising model
 Percolation Theory
How to achieve the labeling? . . .
Connected Component Labeling
wende@zib.de Connected Component Labeling on Xeon Phi 2ISC13, Leipzig
1. Labeling algorithm
2. Parallelization
a. Parallel implementation on CPU
b. Run the CPU code on the Xeon Phi
c. Adapt the code for the Xeon Phi
3. Vectorization (SIMD)
d. Leave it to the compiler (auto-vectorization)
e. SIMD intrinsic functions
Xeon Phi: 512-Bit SIMD unit for 16 x 32-bit words
Connected Component Labeling - Strategy
wende@zib.de Connected Component Labeling on Xeon Phi 3ISC13, Leipzig
 Breadth/Depth first search algorithm, multi-pass algorithms
 Hoshen-Kopelman algorithm
 Cluster self-labeling algorithm by Coddington and Baillie
1. Assign a unique label to each pixel of the image
2. For each pixel consider its adjacent connected pixels in positive 1-, 2-, . . .
direction and set the respective labels to the minimum value each
3. If for all pixels the minimum operation is the identity function: Finished!
Otherwise: Continue with step 2
CPU: Hoshen-Kopelman
Xeon Phi: Hoshen-Kopelman vs. Cluster self-labeling
Connected Component Labeling - Algorithm
wende@zib.de Connected Component Labeling on Xeon Phi 4ISC13, Leipzig
Partition the image into equal-sized sub-images, and label them
independently using multiple threads
Connected Comp. Labeling - Parallelization
wende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
Partition the image into equal-sized sub-images, and label them
independently using multiple threads
 Unique labels across
different sub-images
 Connected regions that
extend over multiple sub-
images are merged after the
labeling using atomic
primitives
Thread 0
Thread 2
Thread 4
Thread 6
Thread 1
Thread 3
Thread 5
Thread 7
Connected Comp. Labeling - Parallelization
wende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
Example: Self-labeling within sub-image of thread 2
 Process multiple data simultaneously using SIMD instructions
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
 Process multiple data simultaneously using SIMD instructions
1. Initialize labeling (array index)
Example: Self-labeling within sub-image of thread 2
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
1. Initialize labeling (array index)
2. Load row[0] into reg0, and
create mask for adjacent
entries in positive 1-direction:
1 if equal-colored
0 otherwise
Example: Self-labeling within sub-image of thread 2
 Process multiple data simultaneously using SIMD instructions
1-direction
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
1. Initialize labeling (array index)
2. Load row[0] into reg0, and
create mask for adjacent
entries in positive 1-direction:
1 if equal-colored
0 otherwise
3. Overlap each element in reg0 with its
adjacent element in positive 1-direction,
and write the result to reg1
Example: Self-labeling within sub-image of thread 2
 Process multiple data simultaneously using SIMD instructions
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1
7. Shift all bits in mask one position up, and write the pairwise minimum
entries in row[0] and reg1 to row[0] using the shifted mask
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
4. Determine the pairwise
minimum of the entries in reg0
and reg1 using the mask, and
write the result to reg1
5. Write back entries in reg1 to
row[0] using the mask
6. Shift all elements in reg1 one
position in positive 1-direction, shifting
in the 0-th element, and write the result to reg1
7. Shift all bits in mask one position up, and write the pairwise minimum
entries in row[0] and reg1 to row[0] using the shifted mask
8. Did labels change?
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
Result of the operations up to now . . .
Set adjacent connected
elements in row[0] to the
pairwise minimum value each
Before
After
Repeat the procedure for the 2-direction.
1-direction
2-direction
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 7ISC13, Leipzig
Repeat the procedure for all other rows as long as labels change . . .
Before
After
Now: Merge labels across different sub-images using atomics!
Finished!
Connected Comp. Labeling - Vectorization
wende@zib.de Connected Component Labeling on Xeon Phi 8ISC13, Leipzig
CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz
 Hoshen-Kopelman algorithm + Atomics for label merging
 Vectorization was left to the compiler: there are no masked SIMD intrinsics!
Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz
 Hoshen-Kopelman vs. Cluster self-labeling + Atomics for label merging
 Vectorization by means of _mm512_[mask]_XXX() instrinsics
Parallelization by means of OpenMP: #pragma omp parallel {...}
Programming effort: approx. 2-3 days for the CPU code (incl. optimization)
less than 1 day for the Xeon Phi code (based on CPU code)
Connected Comp. Labeling - Benchmark
wende@zib.de Connected Component Labeling on Xeon Phi 9ISC13, Leipzig
CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz
Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz
Application: Swendsen-Wang cluster algorithm for the 2D Ising model
Connected Comp. Labeling - Benchmark
wende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz
Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz
Application: Swendsen-Wang cluster algorithm for the 2D Ising model
Connected Comp. Labeling - Benchmark
wende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
Work partially funded by
BMBF Grant No. 01IH11004G
Dr. Thomas Steinke
Zuse-Institute Berlin (ZIB)
Dr. Michael Klemm
Intel GmbH, Germany
Acknowledgement
wende@zib.de Connected Component Labeling on Xeon Phi 11ISC13, Leipzig
[1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithms
for Spin Models – Sequential and Parallel, 1991.
[2] Hoshen, J. and Kopelman, R. Percolation and Cluster Distribution.
I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm.
Phys. Rev. B 14, 3438–3445, 1976
[3] R. H. Swendsen and J.-S. Wang. Nonuniversal Critical Dynamics in
Monte Carlo Simulations. Phys. Rev. Lett., 58:86–88, Jan 1987.
[4] Intel Corp. Intel Xeon Phi Coprocessor 5110P, Product Brief, 2012.
References
wende@zib.de Connected Component Labeling on Xeon Phi 12ISC13, Leipzig

Contenu connexe

Tendances

グラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズムグラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズム
mfumi
 

Tendances (20)

台灣客戶經驗分享: 零售品牌全通路經營-數位轉型新挑戰
台灣客戶經驗分享: 零售品牌全通路經營-數位轉型新挑戰台灣客戶經驗分享: 零售品牌全通路經營-數位轉型新挑戰
台灣客戶經驗分享: 零售品牌全通路經營-數位轉型新挑戰
 
Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習
 
はじめての方向け GANチュートリアル
はじめての方向け GANチュートリアルはじめての方向け GANチュートリアル
はじめての方向け GANチュートリアル
 
コンピュータビジョンの最新ソフトウェア開発環境 SSII2015 チュートリアル hayashi
コンピュータビジョンの最新ソフトウェア開発環境 SSII2015 チュートリアル hayashiコンピュータビジョンの最新ソフトウェア開発環境 SSII2015 チュートリアル hayashi
コンピュータビジョンの最新ソフトウェア開発環境 SSII2015 チュートリアル hayashi
 
Mask-RCNNを用いたキャベツの結球認識
Mask-RCNNを用いたキャベツの結球認識Mask-RCNNを用いたキャベツの結球認識
Mask-RCNNを用いたキャベツの結球認識
 
CMSI計算科学技術特論C (2015) OpenMX とDFT①
CMSI計算科学技術特論C (2015) OpenMX とDFT①CMSI計算科学技術特論C (2015) OpenMX とDFT①
CMSI計算科学技術特論C (2015) OpenMX とDFT①
 
メッセージとストーリーのない発表はカスだ アカデミック・プレゼンテーションのコツ
メッセージとストーリーのない発表はカスだ アカデミック・プレゼンテーションのコツメッセージとストーリーのない発表はカスだ アカデミック・プレゼンテーションのコツ
メッセージとストーリーのない発表はカスだ アカデミック・プレゼンテーションのコツ
 
Crowd Counting & Detection論文紹介
Crowd Counting & Detection論文紹介Crowd Counting & Detection論文紹介
Crowd Counting & Detection論文紹介
 
實驗設計---田口法介紹
實驗設計---田口法介紹實驗設計---田口法介紹
實驗設計---田口法介紹
 
グラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズムグラフを奇麗に描画するアルゴリズム
グラフを奇麗に描画するアルゴリズム
 
ORID 焦點討論法
ORID 焦點討論法ORID 焦點討論法
ORID 焦點討論法
 
A coaching módszerének újabb alkalmazási lehetőségei a szervezetekben
A coaching módszerének újabb alkalmazási lehetőségei a szervezetekbenA coaching módszerének újabb alkalmazási lehetőségei a szervezetekben
A coaching módszerének újabb alkalmazási lehetőségei a szervezetekben
 
第1回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)第1回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)
 
世新資傳 - 資訊機構管理專題個案分析:裕隆企業集團
世新資傳 - 資訊機構管理專題個案分析:裕隆企業集團 世新資傳 - 資訊機構管理專題個案分析:裕隆企業集團
世新資傳 - 資訊機構管理專題個案分析:裕隆企業集團
 
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
Efficient Lifelong Learning with A-GEM ( ICLR 2019 読み会 in 京都 20190602)
 
ISS2018 seminar
ISS2018 seminarISS2018 seminar
ISS2018 seminar
 
(2021年8月版)深層学習によるImage Classificaitonの発展
(2021年8月版)深層学習によるImage Classificaitonの発展(2021年8月版)深層学習によるImage Classificaitonの発展
(2021年8月版)深層学習によるImage Classificaitonの発展
 
Rafael Ortiz - NVH foam equipment for automotive applications
Rafael Ortiz - NVH foam equipment for automotive applicationsRafael Ortiz - NVH foam equipment for automotive applications
Rafael Ortiz - NVH foam equipment for automotive applications
 
フリーソフトではじめるChIP-seq解析_第40回勉強会資料
フリーソフトではじめるChIP-seq解析_第40回勉強会資料フリーソフトではじめるChIP-seq解析_第40回勉強会資料
フリーソフトではじめるChIP-seq解析_第40回勉強会資料
 
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
 

Similaire à Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

Course Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docxCourse Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docx
marilucorr
 

Similaire à Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization (20)

Review: You Only Look One-level Feature
Review: You Only Look One-level FeatureReview: You Only Look One-level Feature
Review: You Only Look One-level Feature
 
Anomaly Detection with Azure and .net
Anomaly Detection with Azure and .netAnomaly Detection with Azure and .net
Anomaly Detection with Azure and .net
 
GSP 215 RANK Education Counseling -- gsp215rank.com
GSP 215 RANK Education Counseling -- gsp215rank.comGSP 215 RANK Education Counseling -- gsp215rank.com
GSP 215 RANK Education Counseling -- gsp215rank.com
 
GSP 215 RANK Education Your Life--gsp215rank.com
GSP 215 RANK Education Your Life--gsp215rank.comGSP 215 RANK Education Your Life--gsp215rank.com
GSP 215 RANK Education Your Life--gsp215rank.com
 
GSP 215 RANK Lessons in Excellence-- gsp215rank.com
GSP 215 RANK Lessons in Excellence-- gsp215rank.comGSP 215 RANK Lessons in Excellence-- gsp215rank.com
GSP 215 RANK Lessons in Excellence-- gsp215rank.com
 
GSP 215 RANK Inspiring Innovation--gsp215rank.com
GSP 215 RANK Inspiring Innovation--gsp215rank.com GSP 215 RANK Inspiring Innovation--gsp215rank.com
GSP 215 RANK Inspiring Innovation--gsp215rank.com
 
GSP 215 RANK Introduction Education--gsp215rank.com
GSP 215 RANK Introduction Education--gsp215rank.comGSP 215 RANK Introduction Education--gsp215rank.com
GSP 215 RANK Introduction Education--gsp215rank.com
 
GSP 215 RANK Education Counseling--gsp215rank.com
 GSP 215 RANK Education Counseling--gsp215rank.com GSP 215 RANK Education Counseling--gsp215rank.com
GSP 215 RANK Education Counseling--gsp215rank.com
 
GSP 215 RANK Education Planning--gsp215rank.com
GSP 215 RANK Education Planning--gsp215rank.comGSP 215 RANK Education Planning--gsp215rank.com
GSP 215 RANK Education Planning--gsp215rank.com
 
Ci25500508
Ci25500508Ci25500508
Ci25500508
 
GSP 215 Inspiring Innovation/tutorialrank.com
GSP 215 Inspiring Innovation/tutorialrank.comGSP 215 Inspiring Innovation/tutorialrank.com
GSP 215 Inspiring Innovation/tutorialrank.com
 
GSP 215 Enhance teaching/tutorialrank.com
 GSP 215 Enhance teaching/tutorialrank.com GSP 215 Enhance teaching/tutorialrank.com
GSP 215 Enhance teaching/tutorialrank.com
 
Gsp 215 Effective Communication / snaptutorial.com
Gsp 215  Effective Communication / snaptutorial.comGsp 215  Effective Communication / snaptutorial.com
Gsp 215 Effective Communication / snaptutorial.com
 
Gsp 215 Believe Possibilities / snaptutorial.com
Gsp 215  Believe Possibilities / snaptutorial.comGsp 215  Believe Possibilities / snaptutorial.com
Gsp 215 Believe Possibilities / snaptutorial.com
 
Com 135 final project user manual
Com 135 final project user manualCom 135 final project user manual
Com 135 final project user manual
 
Anomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NETAnomaly Detection with Azure and .NET
Anomaly Detection with Azure and .NET
 
Gsp 215 Enhance teaching-snaptutorial.com
Gsp 215 Enhance teaching-snaptutorial.comGsp 215 Enhance teaching-snaptutorial.com
Gsp 215 Enhance teaching-snaptutorial.com
 
Course Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docxCourse Project Security Analysis and Redesign of a Network Object.docx
Course Project Security Analysis and Redesign of a Network Object.docx
 
GSP 215 Effective Communication - tutorialrank.com
GSP 215  Effective Communication - tutorialrank.comGSP 215  Effective Communication - tutorialrank.com
GSP 215 Effective Communication - tutorialrank.com
 
Gsp 215 Enthusiastic Study / snaptutorial.com
Gsp 215 Enthusiastic Study / snaptutorial.comGsp 215 Enthusiastic Study / snaptutorial.com
Gsp 215 Enthusiastic Study / snaptutorial.com
 

Plus de Intel IT Center

Plus de Intel IT Center (20)

AI Crash Course- Supercomputing
AI Crash Course- SupercomputingAI Crash Course- Supercomputing
AI Crash Course- Supercomputing
 
FPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsaraFPGA Inference - DellEMC SURFsara
FPGA Inference - DellEMC SURFsara
 
High Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel StationHigh Memory Bandwidth Demo @ One Intel Station
High Memory Bandwidth Demo @ One Intel Station
 
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutionsINFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
INFOGRAPHIC: Advantages of Intel vs. IBM Power on SAP HANA solutions
 
Disrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User AuthenticationDisrupt Hackers With Robust User Authentication
Disrupt Hackers With Robust User Authentication
 
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
Strengthen Your Enterprise Arsenal Against Cyber Attacks With Hardware-Enhanc...
 
Harness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace TodayHarness Digital Disruption to Create 2022’s Workplace Today
Harness Digital Disruption to Create 2022’s Workplace Today
 
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.Don't Rely on Software Alone.Protect Endpoints with Hardware-Enhanced Security.
Don't Rely on Software Alone. Protect Endpoints with Hardware-Enhanced Security.
 
Achieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital WorldAchieve Unconstrained Collaboration in a Digital World
Achieve Unconstrained Collaboration in a Digital World
 
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing GuideIntel® Xeon® Scalable Processors Enabled Applications Marketing Guide
Intel® Xeon® Scalable Processors Enabled Applications Marketing Guide
 
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
#NABshow: National Association of Broadcasters 2017 Super Session Presentatio...
 
Identity Protection for the Digital Age
Identity Protection for the Digital AgeIdentity Protection for the Digital Age
Identity Protection for the Digital Age
 
Three Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a RealityThree Steps to Making a Digital Workplace a Reality
Three Steps to Making a Digital Workplace a Reality
 
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
Three Steps to Making The Digital Workplace a Reality - by Intel’s Chad Const...
 
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
Intel® Xeon® Processor E7-8800/4800 v4 EAMG 2.0
 
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Enterprise Database Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Core Business Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Financial Security Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Telco Cloud Digital Applications Showcase
 
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications ShowcaseIntel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
Intel® Xeon® Processor E5-2600 v4 Tech Computing Applications Showcase
 

Connected Component Labeling on Intel Xeon Phi Coprocessors – Parallelization and Vectorization

  • 1. Florian Wende Zuse-Institute Berlin Connected Component Labeling on Xeon Phi Parallelization & Vectorization
  • 2. wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig Connected Component Labeling Suppose we are given the following image . . .
  • 3. . . . and we are to assign unique labels to different connected regions! Connected Component Labeling wende@zib.de Connected Component Labeling on Xeon Phi 1ISC13, Leipzig
  • 4. . . . and we are to assign unique labels to different connected regions! . . . In parallel?  Computer Vision Detect connected regions in images  Computational Physics Cluster algorithms for the Ising model  Percolation Theory How to achieve the labeling? . . . Connected Component Labeling wende@zib.de Connected Component Labeling on Xeon Phi 2ISC13, Leipzig
  • 5. 1. Labeling algorithm 2. Parallelization a. Parallel implementation on CPU b. Run the CPU code on the Xeon Phi c. Adapt the code for the Xeon Phi 3. Vectorization (SIMD) d. Leave it to the compiler (auto-vectorization) e. SIMD intrinsic functions Xeon Phi: 512-Bit SIMD unit for 16 x 32-bit words Connected Component Labeling - Strategy wende@zib.de Connected Component Labeling on Xeon Phi 3ISC13, Leipzig
  • 6.  Breadth/Depth first search algorithm, multi-pass algorithms  Hoshen-Kopelman algorithm  Cluster self-labeling algorithm by Coddington and Baillie 1. Assign a unique label to each pixel of the image 2. For each pixel consider its adjacent connected pixels in positive 1-, 2-, . . . direction and set the respective labels to the minimum value each 3. If for all pixels the minimum operation is the identity function: Finished! Otherwise: Continue with step 2 CPU: Hoshen-Kopelman Xeon Phi: Hoshen-Kopelman vs. Cluster self-labeling Connected Component Labeling - Algorithm wende@zib.de Connected Component Labeling on Xeon Phi 4ISC13, Leipzig
  • 7. Partition the image into equal-sized sub-images, and label them independently using multiple threads Connected Comp. Labeling - Parallelization wende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
  • 8. Partition the image into equal-sized sub-images, and label them independently using multiple threads  Unique labels across different sub-images  Connected regions that extend over multiple sub- images are merged after the labeling using atomic primitives Thread 0 Thread 2 Thread 4 Thread 6 Thread 1 Thread 3 Thread 5 Thread 7 Connected Comp. Labeling - Parallelization wende@zib.de Connected Component Labeling on Xeon Phi 5ISC13, Leipzig
  • 9. Example: Self-labeling within sub-image of thread 2  Process multiple data simultaneously using SIMD instructions Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 10.  Process multiple data simultaneously using SIMD instructions 1. Initialize labeling (array index) Example: Self-labeling within sub-image of thread 2 Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 11. 1. Initialize labeling (array index) 2. Load row[0] into reg0, and create mask for adjacent entries in positive 1-direction: 1 if equal-colored 0 otherwise Example: Self-labeling within sub-image of thread 2  Process multiple data simultaneously using SIMD instructions 1-direction Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 12. 1. Initialize labeling (array index) 2. Load row[0] into reg0, and create mask for adjacent entries in positive 1-direction: 1 if equal-colored 0 otherwise 3. Overlap each element in reg0 with its adjacent element in positive 1-direction, and write the result to reg1 Example: Self-labeling within sub-image of thread 2  Process multiple data simultaneously using SIMD instructions Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 13. 4. Determine the pairwise minimum of the entries in reg0 and reg1 using the mask, and write the result to reg1 Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 14. 4. Determine the pairwise minimum of the entries in reg0 and reg1 using the mask, and write the result to reg1 5. Write back entries in reg1 to row[0] using the mask Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 15. 4. Determine the pairwise minimum of the entries in reg0 and reg1 using the mask, and write the result to reg1 5. Write back entries in reg1 to row[0] using the mask 6. Shift all elements in reg1 one position in positive 1-direction, shifting in the 0-th element, and write the result to reg1 Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 16. 4. Determine the pairwise minimum of the entries in reg0 and reg1 using the mask, and write the result to reg1 5. Write back entries in reg1 to row[0] using the mask 6. Shift all elements in reg1 one position in positive 1-direction, shifting in the 0-th element, and write the result to reg1 7. Shift all bits in mask one position up, and write the pairwise minimum entries in row[0] and reg1 to row[0] using the shifted mask Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 17. 4. Determine the pairwise minimum of the entries in reg0 and reg1 using the mask, and write the result to reg1 5. Write back entries in reg1 to row[0] using the mask 6. Shift all elements in reg1 one position in positive 1-direction, shifting in the 0-th element, and write the result to reg1 7. Shift all bits in mask one position up, and write the pairwise minimum entries in row[0] and reg1 to row[0] using the shifted mask 8. Did labels change? Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 6ISC13, Leipzig
  • 18. Result of the operations up to now . . . Set adjacent connected elements in row[0] to the pairwise minimum value each Before After Repeat the procedure for the 2-direction. 1-direction 2-direction Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 7ISC13, Leipzig
  • 19. Repeat the procedure for all other rows as long as labels change . . . Before After Now: Merge labels across different sub-images using atomics! Finished! Connected Comp. Labeling - Vectorization wende@zib.de Connected Component Labeling on Xeon Phi 8ISC13, Leipzig
  • 20. CPU: Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz  Hoshen-Kopelman algorithm + Atomics for label merging  Vectorization was left to the compiler: there are no masked SIMD intrinsics! Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz  Hoshen-Kopelman vs. Cluster self-labeling + Atomics for label merging  Vectorization by means of _mm512_[mask]_XXX() instrinsics Parallelization by means of OpenMP: #pragma omp parallel {...} Programming effort: approx. 2-3 days for the CPU code (incl. optimization) less than 1 day for the Xeon Phi code (based on CPU code) Connected Comp. Labeling - Benchmark wende@zib.de Connected Component Labeling on Xeon Phi 9ISC13, Leipzig
  • 21. CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz Application: Swendsen-Wang cluster algorithm for the 2D Ising model Connected Comp. Labeling - Benchmark wende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
  • 22. CPU: Intel Xeon E5-2670, 8 Cores + 2-way Hyper-Threading @ 2.6GHz Xeon Phi: 60 Cores + 4-way Hyper-Threading @ 1.1GHz Application: Swendsen-Wang cluster algorithm for the 2D Ising model Connected Comp. Labeling - Benchmark wende@zib.de Connected Component Labeling on Xeon Phi 10ISC13, Leipzig
  • 23. Work partially funded by BMBF Grant No. 01IH11004G Dr. Thomas Steinke Zuse-Institute Berlin (ZIB) Dr. Michael Klemm Intel GmbH, Germany Acknowledgement wende@zib.de Connected Component Labeling on Xeon Phi 11ISC13, Leipzig
  • 24. [1] C. F. Baillie and P. D. Coddington. Cluster Identification Algorithms for Spin Models – Sequential and Parallel, 1991. [2] Hoshen, J. and Kopelman, R. Percolation and Cluster Distribution. I. Cluster Multiple Labeling Technique and Critical Concentration Algorithm. Phys. Rev. B 14, 3438–3445, 1976 [3] R. H. Swendsen and J.-S. Wang. Nonuniversal Critical Dynamics in Monte Carlo Simulations. Phys. Rev. Lett., 58:86–88, Jan 1987. [4] Intel Corp. Intel Xeon Phi Coprocessor 5110P, Product Brief, 2012. References wende@zib.de Connected Component Labeling on Xeon Phi 12ISC13, Leipzig