SlideShare une entreprise Scribd logo
1  sur  89
Medical image processing strategies for multi-core CPUs Daniel Blezek, Mayo Clinic blezek.daniel@mayo.edu
Poll Does your primary computer have more than one core...? 2 Have you ever written parallel code?
It’s a parallel world... SMP formerly was the domain of researchers Thanks to Intel, now it’s everywhere! 3 ... but most of us think in serial ... ,[object Object]
Developers are not trained
Development of parallel software is difficult
Outside the box
Erlang
Scala
...shoehorn
Parallel Computing – according to Google “parallel computing” 1.4M hits on Google “multithreading” 10M hits “multicore” 2.4M hits “parallel programming” 1.1M hits Why is it so hard? the world is parallel we all think in parallel yet we are taught to program in serial 4 driving
Degrees of parallelism (my take) Serial – SISD single thread of execution Data parallel – SIMD (fine grained parallelism) Embarrassingly parallel – larger scale SIMD CT or MR reconstruction Each operation is independent, e.g. iFFT of slices Worker thread – e.g. virus scanning software Coarse grained parallelism – SMP or MIMD Focus of this presentation, more in GPU talk Concurrency, OpenMP, TBB, pthreads/Winthreads Large scale – MPI on cluster, tight coupling Large scale – Grid computing, loose coupling 5
Pragmatic approach C/C++ and Fortran are the kings of performance (I’ve never written a single line of Fortran, so don’t ask) “Bolted on” parallel concepts Zero language support Huge existing codebase 6
Pragmatic approach Briefly touch on SIMD Introduce SMP concepts Threads, concurrency Development models pthreads/WinThreads OpenMP TBB ITK Medical Image Processing Example problems Common errors Next steps 7 packed
SIMD 8
SIMD – basic principles 9 http://en.wikipedia.org/wiki/SIMD
Data structures for SIMD Array of Structures struct Vec { 	float x, y, z; }; Vec[] points = new Vec[sz]; 10 X Y Z -- Pack X Y Z -- X Y Z -- * Unpack X Y Z --
Data structures for SIMD  11 Structure of Arrays struct Vec { 	float[] x; 	float[] y; 	float[] z; 	Vec ( int sz ) { 	 x = new float[sz];  y = new float[sz];  z = new float[sz]; 	}; }; Structure of Arrays struct Vec{ 	Vector4f[] v; 	Vec ( int sz ) {   // must be word    // aligned  v =     new Vector4f[sz]; 	}; };
SIMD pitfalls Structure alignment Usually needs to be aligned on word boundary Structure considerations May need to refactor existing code/structures Generally not cross-platform MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc... Performance gains are modest 2x – 4x common Limited instructions Add, multiply, divide, round Not suitable for branching logic Autovectorizing compilers for simple loops -ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler) 12
Threads 13
14 Threads – they’re everywhere
SMP concepts 15 Useful to think in terms of “cores” 2 dual-core CPU = 4 “cores” Cores share main memory, may share cache Threads in same process share memory Generally, one executing thread per core Other threads sleeping
Cores – they’re everywhere 16 How many cores does your laptop have? Mine has 50(!) 	2 Intel CPU (Core 2 Duo) 	32 nVidia cores (9600M GT)    16 nVidia cores (9400M)
Parallel concepts for SMP Process Started by the OS Single thread executes “main” No direct access to memory of other processes Threads Stream of execution under a process Access to memory in containing process Private memory Lifetime may be less than main thread Concurrency Coordination between threads High level (mutex, locks, barriers) Low level (atomic operations) 17
Processes & Threads 18 Process Thread NoNo
#include <pthread.h> // Thread work function, must return pointer to void void *doWork(void *work) {   // Do work   return work; // equivalent to pthread_exit ( myWork ); } ... pthread_t child; ... rc=pthread_create(&child, &attr, doWork, (void *)work); ... rc = pthread_join ( child, &threadwork ); ... Thread construction – pthread example 19
Thread construction – Win32 example 20 #include <windows.h> DWORD WINAPI doWork( LPVOID work) {}; ... PMYDATA work; DWORD   childID; HANDLE  child; child = CreateThread(         NULL,         // default security attributes        0,            // use default stack size          doWork,       // thread function name        work,         // argument to thread function         0,            // use default creation flags         &childID);    // returns the thread identifier WaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
Thread construction – Java example 21 import java.lang.Thread; class Worker implements Runnable { 	public Worker ( Work work ) {};   public void run() {}; // Do work here } ... Worker worker = new Worker ( someWork ); New Thread ( worker ).start();
Race Conditions 22 Serial Parallel Problem! nono/door
Mutex Mutex – Mutual exclusion lock Protects a section of code Only one thread has a lock on the object Threads may wait for the mutex return a status if the mutex is locked Semaphore N threads Critical Section One thread executes code Protects global resources Maintain consistent state 23
Race Conditions 24 ... N = 0; ... // Start some threads ... void* doWork() {   N++; // get, incr, store } Mutexmutex; mutex.lock(); mutex.release(); Solution w/Mutex NoNo
Atomic operations Locks are not perfect Cause blocking Relatively heavy-weight Atomic operations Simple operations Hardware support Can implement w/Mutex Conditions Invisibility – no other thread knows about the change Atomicity – if operation fails, return to original state 25
Deadlock Deadlock 26 Mutex A Mutex B Mutex Thread NoNo
Thread synchronization – barrier Initialized with the number of threads expected Threads signal when they are ready Wait until all expected threads are there A stalled or dead thread can stall all the threads 27
Thread synchronization – Condition variables Workers atomically release mutex and wait Master atomically releases mutex and signals Workers wake up and acquire mutex 28 Mutex A Working Condition Mutex A Condition Mutex A Wait Mutex A Condition Mutex A Mutex Thread
Thread pool & Futures 29 Maintains a “pool” of Worker threads Work queued until thread available Optionally notify through a “Future” Future can query status, holds return value Thread returns to pool, no startup overhead Core concept for OpenMP and TBB
OpenMP 30
Introduction to OpenMP Scatter / gather paradigm Maintains a thread pool Requires compiler support Visual C++, gcc 4.0, Intel Compiler Easy to adapt existing serial code, easy to debug Simple paradigm 31
OpenMP – simple parallel sections 32 #pragmaomp parallel sections num_threads ( 5 ) {   // 5 Threads scatter here 	#pragmaomp section   { // Do task 1 }   #pragmaomp section   { // Do task 2 }   ...   #pragmaomp section   { // Do task N }   // Implicit barrier } Barrier ... NoNo
OpenMP – parallel for 33 #pragmaomp parallel for for ( int i = 0; i < NumberOfIterations; i++ ) {   // Threads scatter here   // each thread has a private copy of i doSomeWork( i ); } // Implicit barrier Scheduling the iterations
OpenMP – reduction 34 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) {   // Threads scatter here   // each thread has a private copy of i & TotalAmountOfWork TotalAmountOfWork += doSomeWork( i ); } // Implicit barrier // TotalAmountOfWork was properly accumulated // Each thread has local copy, barrier does reduction // No need to use critical sections
OpenMP – “atomic” reduction 35 int TotalAmountOfWork = 0; #pragmaomp parallel for for ( int i = 0; i < NumberOfIterations; i++ ) {   // Threads scatter here   int myWork = doSomeWork( i);   #pragmaomp atomic TotalAmountOfWork += myWork; } // Implicit barrier // TotalAmountOfWork was properly accumulated // However, the atomic section can cause thread stalls
OpenMP – critical 36 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) {   // Threads scatter here   // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i );   #pragmaomp critical   {     // Execute by one thread at a time, e.g., “Mutex lock” criticalOperation();   } } // Implicit barrier
OpenMP – single 37 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) {   // Threads scatter here   // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i );   #pragmaomp single nowait   {     // Execute by one thread, use “master” for the main thread reportProgress ( TotalAmountOfWork );   }  // !! No implicit barrier because of “nowait” clause !! } // Implicit barrier
Threading Building Blocks (TBB) 38
Introduction to TBB Commercial and Open Source Licenses GPL with runtime exception Cross-platform C++ library Similar to STL Usual concurrency classes Several different constructs for threading for, do, reduction, pipeline Finer control over scheduling Maintains a thread pool to execute tasks http://www.threadingbuildingblocks.org/ 39
TBB – parallel for	 40 #include "tbb/blocked_range.h” #include "tbb/parallel_for.h”	  class Worker {  public:   Worker ( /* ... */ ) {...};   void operator() ( const tbb::blocked_range<int>& r ) const {     for ( int i = r.begin(); i != r.end(); ++i ) {       doWork ( i );     }   } }; ... tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ), 	Worker ( /* ... */ ), tbb::auto_partitioner() );
TBB – parallel reduction 41 #include "tbb/blocked_range.h” #include "tbb/parallel_reduce.h”	  class ReducingWorker {   int mLocalWork;  public: ReducingWorker ( /* ... */ ) {...}; ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {};   void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork};   void operator() ( const tbb::blocked_range<int>& r ) { ... } }; ... Worker w; tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ), 	w, tbb::auto_partitioner() ); w.getLocalWork();
TBB – parallel reduction 42
TBB – synchronization 43 tbb::spin_mutex MyMutex; void doWork ( /* ... */ ) {   // Enter critical section, exit when lock goes out of scope tbb::spin_mutex::scoped_lock lock ( MyMutex );   // NB: This is an error!!!   // tbb::spin_mutex::scoped_lock( MyMutex ); } ... #include <tbb/atomic.h> tbb::atomic<int> MyCounter; ... MyCounter = 0;  // Atomic int i = MyCounter;  // Atomic MyCounter++; MyCounter--; ++MyCounter; --MyCounter;  // Atomic ... MyCounter = 0; MyCounter += 2;  // Watch out for other threads!
ITK Model 44
ITK Implementation Threads operate across slices Only implemented behavior in ITK itk::MultiThreader is somewhat flexible Requires that you break the ITK model Uses Thread Join, higher overhead No thread pool 45
Comparison 46 Language specific (Java) + Fine-grain control + Cross-platform easy(?) + Many constructs +/- Language-specific Threads (C/C++) + Fine-grain control ,[object Object]
 Few constructsITK + Integrated + Simple ,[object Object],+/- ITK only TBB +/- More complex + Fine-grain control + Intel (-?) + Open Source + Some constructs ,[object Object],OpenMP + Simple + Adapt existing code +/- Industry standard +/- Compiler support ,[object Object],diy
Medical Imaging 47
Image class 48 class Image {   public:     short* mData;     int mWidth, mHeight, mDepth;     int mVoxelsPerSlice;     int mVoxelsPerVolume;     short* mSlicePointers; // Pointers to the start of each slice     short getVoxel ( int x, int y, int z ) {...}     void setVoxel ( int x, int y, int z, short v ) {...} };
Trivial problem – threshold Threshold an image If intensity > 100, output 1 otherwise output 0 Present from simple to complex OpenMP TBB ITK pthread(see extra slides) 49
Threshold – OpenMP #1 50 void doThreshold ( Image* in, Image* out ) { #pragmaomp parallel for   for ( int z = 0; z < in->mDepth; z++ ) {     for ( int y = 0; y < in->mHeight; y++ ) {       for ( int x = 0; x < in->mWidth; x++ ) {         if ( in->getVoxel(x,y,z) > 100 ) {           out->setVoxel(x,y,z,1);         } else {           out->setVoxel(x,y,z,0);         }       }     }   } } // NB: can loop over slices, rows or columns by moving // pragma, but must choose at compile time
Threshold – OpenMP #2 51 void doThreshold ( Image* in, Image* out ) { #pragmaomp parallel for   for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) {     if ( in->mData[s] > 100 ) {       out->mData[s] = 1;     } else {       out->mData[s] = 0;     }   } } // Likely a lot faster than previous code
class Threshold {   public:     Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}     void operator() ( const tbb::blocked_range<int>& r ) {       for ( int x = r.begin(); x != r.end(); ++x ) {         if ( in->mData[x] > 100 ) {           out->mData[x] = 1;         } else {           out->mData[x] = 0;         }       }    } } ... parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ),     Threshold ( in, out ), auto_partitioner() ); // NB: default “grain size” for blocked_range is 1 pixel // tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs ) Threshold – TBB #1 52
class Threshold {   public:     Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}     void operator() ( const tbb::blocked_range<int>& r ) {...}     void operator() ( const tbb::blocked_range2d<int,int>& r ) {       for ( int z = in->mDepth; z < in->mDepth; z++ ) {         for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {           for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){             if ( in->getVoxel(x,y,z) > 100 ) {               out->setVoxel(x,y,z,1);             } else {               out->setVoxel(x,y,z,0);             } } } }     } }; ... parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32                                               0, in->mWidth,  32 ),     Threshold ( in, out ), auto_partitioner() );	 Threshold – TBB #2 53
class Threshold {	   public:     Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...}     void operator() ( const tbb::blocked_range<int>& r ) {...}     void operator() ( const tbb::blocked_range2d<int,int>& r ) {...}     void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {       for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {         for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {           for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){             if ( in->getVoxel(x,y,z) > 100 ) {               out->setVoxel(x,y,z,1);             } else {               out->setVoxel(x,y,z,0);             } } } }     } }; ... parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1                                                  0, in->mHeight, 32                                                  0, in->mWidth, 32 ),     Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #3	 54
Threshold – ITK solution 55 ThreadedGenerateData( const OutputImageRegionType out, int threadId) { ...   // Define the iterators ImageRegionConstIterator<TIn>  inputIt(inputPtr, out); ImageRegionIterator<TOut> outputIt(outputPtr, out); inputIt.GoToBegin(); outputIt.GoToBegin();   while( !inputIt.IsAtEnd() )      {     if ( inputIt.Get() > 100 ) { outputIt.Set ( 1 );     } else { outputIt.Set ( 0 );     {     ++inputIt;     ++outputIt; } }
Interesting problem – anisotropic diffusion Edge preserving smoothing method Perona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639 Iterative process Demonstrate OpenMP TBB (ITK has an implementation) (pthreads are tedious at the very least) Pop quiz – are the following correct? 56
Anisotropic diffusion – OpenMP 57 void doAD ( Image* in, Image* out ) { #pragmaomp parallel for   for ( int t = 0; t < TotalTime; t++ ) {     for ( int z = 0; z < in->mDepth; z++ ) {       ...     }   } }
Anisotropic diffusion – OpenMP 58 void doAD ( Image* in, Image* out ) {   short *previousSlice, *slice, *nextSlice;   for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for     for ( int z = 1; z < in->mDepth-1; z++ ) { previousSlice = in->mSlicePointers[z-1];       slice = in->mSlicePointers[z]; nextSlice = in->mSlicePointers[z+1];       for ( int y = 1; y < in->mHeight-1; y++ ) {         short* previousRow = slice + y-1 * in->mWidth;         short* row = slice + y * in->mWidth;         short* nextRow = slice + y-1 * in->mWidth;         short* aboveRow = previousSlice + y * in->mWidth;         short* belowRow = nextSlice + y * in->mWidth;         for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x];           ...
Anisotropic diffusion – OpenMP 59 void doAD ( Image* in, Image* out ) {   for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for     for ( int z = 1; z < in->mDepth-1; z++ ) {       short* previousSlice = in->mSlicePointers[z-1];       short* slice = in->mSlicePointers[z];       short* nextSlice = in->mSlicePointers[z+1];       for ( int y = 1; y < in->mHeight-1; y++ ) {         short* previousRow = slice + y-1 * in->mWidth;         short* row = slice + y * in->mWidth;         short* nextRow = slice + y-1 * in->mWidth;         short* aboveRow = previousSlice + y * in->mWidth;         short* belowRow = nextSlice + y * in->mWidth;         for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x];           ...
Anisotropic diffusion – TBB #1 60 class doAD {   public:   static ADConstants* sConstants; doAD ( Image* in, Image* out ) { ... }   void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {     if ( !sConstants == NULL ) { initConstants(); }     // process     ...   } }
Threshold – TBB #2	 61 class doAD {	   public: doAd ( ... ) {...}     void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {       for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) {         for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {           for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){           ...   } }; ... parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth                                                  0, in->mHeight                                                  0, in->mWidth ), doAD ( in, out ), auto_partitioner() );
Threshold – TBB #3	 62 class doAD {	   public:     static tbb::atomic<int> sProgress; tbb::spin_mutexmMutex; doAd ( ... ) {...}     void reportProgress ( int p ) { ... }     void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {       for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress );         for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {           for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){           ...   } }; ... doAD::sProgress = 0; parallel_for (...);
Threshold – TBB #4	 63 class doAD {	   public:     static tbb::atomic<int> sProgress;     static tbb::spin_mutexmMutex; doAd ( ... ) {...}     void reportProgress ( int p ) { ... }     void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {       for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress );         for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) {           for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){           ...   } }; ... doAD::sProgress = 0; parallel_for (...);
nowait Anisotropic diffusion – OpenMP (Progress) 64 using std; void doAD ( Image* in, Image* out ) { int progress = 0; for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for     for ( int s = 0; s < in->mDepth; s++ ) {       #pragmaomp atomic       progress++;       #pragmaomp single reportProgress ( progress );       ...     }   } }
Real-life problem Compute Frangi’svesselness measure Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956 Memory constrained solution ITK implementation requires 1.2G for 100M volume Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007) Possible solutions using OpenMP, TBB 65
Vesselness 66
ITK Implementation – computing the Hessian 6 volumes computed in serial Individual filters are threaded Good CPU usage High memory requirements 67
Design considerations Break problem into blocks Compute hessian, eigenvalues, and vesselness Reduces memory requirements Incurs overhead, boundary conditions 68
Design considerations 69 keep cpu’s full
Design considerations – boundary condition 70
Trade-offs 71
Algorithm sketch – Serial 72 intBlockSize = 32; for ( intz = 0; z < image->mDepth; z += BlockSize ) {   for ( inty = 0; y < image->mHeight; y += BlockSize ) {     for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize );     }   } }
Algorithm sketch – OpenMP 73 intBlockSize = 32; #pragmaomp parallel for for ( intz = 0; z < image->mDepth; z += BlockSize ) {   for ( inty = 0; y < image->mHeight; y += BlockSize ) {     for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize );     }   } } Each thread is on a different slice May cause cache contention Similar problems for “y” direction
Algorithm sketch – OpenMP 74 intBlockSize = 32; for ( intz = 0; z < image->mDepth; z += BlockSize ) {   for ( inty = 0; y < image->mHeight; y += BlockSize ) { #pragmaomp parallel for     for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize );     }   } } All threads on same rows May not utilize all CPUs If Ratio of Width to BlockSize < # CPUs Better cache utilization
Algorithm sketch – TBB 75 class Vesselness {	   public:   void operator() ( const tbb::blocked_range3d<int,int,int>& r ) {       // Process the block, could use ITK here processBlock (  r.cols().begin(), r.rows().begin(), r.pages().begin(), r.cols().size(),  r.rows().size(),  r.pages().size() ); ... parallel_for ( tbb::blocked_range3d<int,int,int>(                  0, in->mDepth, 32                  0, in->mHeight, 32                  0, in->mWidth, 32 ), Vesselness( in, out ), auto_partitioner() ); Individual blocks Full CPUs May not have best cache performance
Next steps Go try parallel development Try threads to gain understanding and insight Next OpenMP, adapting existing code TBB: more constructs, different approachs Experiment with new languages Erlang, Scala, Reia, Chapel, X10, Fortress... Check out some of the resources provided Have fun!  It’s a brave new world out there... 76
Resources TBB (http://www.threadingbuildingblocks.org/) OpenMP (http://openmp.org/wp/) Books/Articles Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/) Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/) ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf) The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf) Tutorials Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/) pthreads (https://computing.llnl.gov/tutorials/pthreads/) OpenMP (https://computing.llnl.gov/tutorials/openMP/) Other LLNL (https://computing.llnl.gov/) Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language) GCC-OpenMP (http://gcc.gnu.org/projects/gomp/) Intel Compiler (http://software.intel.com/en-us/intel-compilers/) 77
Resources Languages Erlang (http://www.erlang.org/) Scala (http://www.scala-lang.org/) Chapel (http://chapel.cs.washington.edu/) X10 (http://x10-lang.org/) Unified Parallel C (http://upc.gwu.edu/) Titanium (http://titanium.cs.berkeley.edu/) Co-Array Fortran (http://www.co-array.org/) ZPL (http://www.cs.washington.edu/research/zpl/home/index.html) High Performance Fortran (http://hpff.rice.edu/) Fortress (http://projectfortress.sun.com/Projects/Community/)  Others (http://www.google.com/search?q=parallel+programming+language) 78
Medical image processing strategies for multi-core CPUs Daniel Blezek, Mayo Clinic blezek.daniel@mayo.edu
Thread construction – pthread example 80 include <pthread.h> void *(*start_routine)(void *); int pthread_create(pthread_t *restrict thread,                const pthread_attr_t *restrict attr,                void *(*start_routine)(void *),                void *restrict arg); void pthread_exit(void *value_ptr); int pthread_join(pthread_t thread, void **value_ptr);
Mutex – pthread example 81 #include <pthread.h> pthread_mutex_t myMutex; ... pthread_mutex_init ( &myMutex, NULL ); ... pthread_mutex_lock ( &myMutex ); // Critical Section, only one thread at a time ... pthread_mutex_unlock ( &myMutex ); ... if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) {   // We did get the lock, so we are in the critical section   ...   pthread_mutex_unlock ( &myMutex ); }
Mutex – Java example 82 import java.lang.*; class Foo {   public synchronized int doWork () { 	  // only one thread can execute doWork   }   Object resource; 	public int otherWork () {     synchronized ( resource ) {       // critical section, resource is the mutex       ...     } }

Contenu connexe

Tendances

JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?Doug Hawkins
 
Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Michael Barker
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?Roman Elizarov
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mpranjit banshpal
 
Objective-C Blocks and Grand Central Dispatch
Objective-C Blocks and Grand Central DispatchObjective-C Blocks and Grand Central Dispatch
Objective-C Blocks and Grand Central DispatchMatteo Battaglio
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersDhanashree Prasad
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gcexsuns
 
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APITensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APIMr. Vengineer
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaCharles Nutter
 
JVM for Dummies - OSCON 2011
JVM for Dummies - OSCON 2011JVM for Dummies - OSCON 2011
JVM for Dummies - OSCON 2011Charles Nutter
 
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHackito Ergo Sum
 
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Charles Nutter
 
Know yourengines velocity2011
Know yourengines velocity2011Know yourengines velocity2011
Know yourengines velocity2011Demis Bellot
 

Tendances (20)

JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?JVM Mechanics: When Does the JVM JIT & Deoptimize?
JVM Mechanics: When Does the JVM JIT & Deoptimize?
 
Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!Lock? We don't need no stinkin' locks!
Lock? We don't need no stinkin' locks!
 
FreeRTOS Course - Semaphore/Mutex Management
FreeRTOS Course - Semaphore/Mutex ManagementFreeRTOS Course - Semaphore/Mutex Management
FreeRTOS Course - Semaphore/Mutex Management
 
Cp7 rpc
Cp7 rpcCp7 rpc
Cp7 rpc
 
concurrency_c#_public
concurrency_c#_publicconcurrency_c#_public
concurrency_c#_public
 
Why GC is eating all my CPU?
Why GC is eating all my CPU?Why GC is eating all my CPU?
Why GC is eating all my CPU?
 
OpenMP And C++
OpenMP And C++OpenMP And C++
OpenMP And C++
 
C++11 talk
C++11 talkC++11 talk
C++11 talk
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Objective-C Blocks and Grand Central Dispatch
Objective-C Blocks and Grand Central DispatchObjective-C Blocks and Grand Central Dispatch
Objective-C Blocks and Grand Central Dispatch
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
 
java memory management & gc
java memory management & gcjava memory management & gc
java memory management & gc
 
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network APITensorFlow Lite (r1.5) & Android 8.1 Neural Network API
TensorFlow Lite (r1.5) & Android 8.1 Neural Network API
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Fast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible JavaFast as C: How to Write Really Terrible Java
Fast as C: How to Write Really Terrible Java
 
JVM for Dummies - OSCON 2011
JVM for Dummies - OSCON 2011JVM for Dummies - OSCON 2011
JVM for Dummies - OSCON 2011
 
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe ShockwaveHES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
HES2011 - Aaron Portnoy and Logan Brown - Black Box Auditing Adobe Shockwave
 
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
Øredev 2011 - JVM JIT for Dummies (What the JVM Does With Your Bytecode When ...
 
Know yourengines velocity2011
Know yourengines velocity2011Know yourengines velocity2011
Know yourengines velocity2011
 

En vedette

Medical image processing studies
Medical image processing studiesMedical image processing studies
Medical image processing studiesBằng Nguyễn Kim
 
Applications of Digital image processing in Medical Field
Applications of Digital image processing in Medical FieldApplications of Digital image processing in Medical Field
Applications of Digital image processing in Medical FieldAshwani Srivastava
 
Gfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeGfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeHarveyMagee
 
Using information technology in medical professionalism
Using information technology in medical professionalismUsing information technology in medical professionalism
Using information technology in medical professionalismMTD Lakshan
 
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...Avinash Nandakumar
 
An Introduction to Image Processing and Artificial Intelligence
An Introduction to Image Processing and Artificial IntelligenceAn Introduction to Image Processing and Artificial Intelligence
An Introduction to Image Processing and Artificial IntelligenceWasif Altaf
 
Using 3DP in Plastic & Reconstructive Surgery
Using 3DP in Plastic & Reconstructive SurgeryUsing 3DP in Plastic & Reconstructive Surgery
Using 3DP in Plastic & Reconstructive SurgeryRising Media, Inc.
 
Medical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationMedical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationSubarno Pal
 
Health Informatics & eHealth: Application of ICT for Health
Health Informatics & eHealth: Application of ICT for HealthHealth Informatics & eHealth: Application of ICT for Health
Health Informatics & eHealth: Application of ICT for HealthNawanan Theera-Ampornpunt
 
Basic image processing
Basic image processingBasic image processing
Basic image processingJay Thakkar
 
Medical Physics 102 - Clinical Leadership - Prado
Medical Physics 102 - Clinical Leadership - PradoMedical Physics 102 - Clinical Leadership - Prado
Medical Physics 102 - Clinical Leadership - PradoKarl Prado
 
The Birth of Doraemon
The Birth of DoraemonThe Birth of Doraemon
The Birth of DoraemonNVIDIA Taiwan
 
Medical image Processing - Vahid Nayini
Medical image Processing - Vahid NayiniMedical image Processing - Vahid Nayini
Medical image Processing - Vahid Nayiniirpycon
 
Embedded and Reliable Computer Vision
Embedded and Reliable Computer VisionEmbedded and Reliable Computer Vision
Embedded and Reliable Computer VisionNVIDIA Taiwan
 
Artificial intelligence in medical image processing
Artificial intelligence in medical image processingArtificial intelligence in medical image processing
Artificial intelligence in medical image processingFarzad Jahedi
 
Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1NVIDIA Taiwan
 
Precision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsPrecision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsWarren Kibbe
 

En vedette (20)

Medical image processing studies
Medical image processing studiesMedical image processing studies
Medical image processing studies
 
Applications of Digital image processing in Medical Field
Applications of Digital image processing in Medical FieldApplications of Digital image processing in Medical Field
Applications of Digital image processing in Medical Field
 
Medical Image Processing
Medical Image ProcessingMedical Image Processing
Medical Image Processing
 
Gfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-mageeGfh 17 may-tatrc, afsim-magee
Gfh 17 may-tatrc, afsim-magee
 
Using information technology in medical professionalism
Using information technology in medical professionalismUsing information technology in medical professionalism
Using information technology in medical professionalism
 
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
36324442 biosignal-and-bio-medical-image-processing-matlab-based-applications...
 
Medical Informatics
Medical InformaticsMedical Informatics
Medical Informatics
 
An Introduction to Image Processing and Artificial Intelligence
An Introduction to Image Processing and Artificial IntelligenceAn Introduction to Image Processing and Artificial Intelligence
An Introduction to Image Processing and Artificial Intelligence
 
Using 3DP in Plastic & Reconstructive Surgery
Using 3DP in Plastic & Reconstructive SurgeryUsing 3DP in Plastic & Reconstructive Surgery
Using 3DP in Plastic & Reconstructive Surgery
 
Medical Image Analysis and Its Application
Medical Image Analysis and Its ApplicationMedical Image Analysis and Its Application
Medical Image Analysis and Its Application
 
Health Informatics & eHealth: Application of ICT for Health
Health Informatics & eHealth: Application of ICT for HealthHealth Informatics & eHealth: Application of ICT for Health
Health Informatics & eHealth: Application of ICT for Health
 
Basic image processing
Basic image processingBasic image processing
Basic image processing
 
Medical Physics 102 - Clinical Leadership - Prado
Medical Physics 102 - Clinical Leadership - PradoMedical Physics 102 - Clinical Leadership - Prado
Medical Physics 102 - Clinical Leadership - Prado
 
The Birth of Doraemon
The Birth of DoraemonThe Birth of Doraemon
The Birth of Doraemon
 
Medical image Processing - Vahid Nayini
Medical image Processing - Vahid NayiniMedical image Processing - Vahid Nayini
Medical image Processing - Vahid Nayini
 
Embedded and Reliable Computer Vision
Embedded and Reliable Computer VisionEmbedded and Reliable Computer Vision
Embedded and Reliable Computer Vision
 
Artificial intelligence in medical image processing
Artificial intelligence in medical image processingArtificial intelligence in medical image processing
Artificial intelligence in medical image processing
 
Health informatics
Health informaticsHealth informatics
Health informatics
 
Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1Medical Image Processing on NVIDIA TK1/TX1
Medical Image Processing on NVIDIA TK1/TX1
 
Precision Medicine in Oncology Informatics
Precision Medicine in Oncology InformaticsPrecision Medicine in Oncology Informatics
Precision Medicine in Oncology Informatics
 

Similaire à Medical Image Processing Strategies for multi-core CPUs

Here comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfHere comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfKrystian Zybała
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!C4Media
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threadingAntonio Cesarano
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/MultitaskingSasha Kravchuk
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory AnalysisMoabi.com
 
OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3Peter Tröger
 
Programming with Threads in Java
Programming with Threads in JavaProgramming with Threads in Java
Programming with Threads in Javakoji lin
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory AnalysisMoabi.com
 
Intro To .Net Threads
Intro To .Net ThreadsIntro To .Net Threads
Intro To .Net Threadsrchakra
 
Java multi threading
Java multi threadingJava multi threading
Java multi threadingRaja Sekhar
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJAX London
 
Java - A broad introduction
Java - A broad introductionJava - A broad introduction
Java - A broad introductionBirol Efe
 

Similaire à Medical Image Processing Strategies for multi-core CPUs (20)

Here comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdfHere comes the Loom - Ya!vaConf.pdf
Here comes the Loom - Ya!vaConf.pdf
 
Let's Talk Locks!
Let's Talk Locks!Let's Talk Locks!
Let's Talk Locks!
 
Threads and multi threading
Threads and multi threadingThreads and multi threading
Threads and multi threading
 
.NET Multithreading/Multitasking
.NET Multithreading/Multitasking.NET Multithreading/Multitasking
.NET Multithreading/Multitasking
 
Multi Threading
Multi ThreadingMulti Threading
Multi Threading
 
Threads
ThreadsThreads
Threads
 
[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis[CCC-28c3] Post Memory Corruption Memory Analysis
[CCC-28c3] Post Memory Corruption Memory Analysis
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3OpenHPI - Parallel Programming Concepts - Week 3
OpenHPI - Parallel Programming Concepts - Week 3
 
Java Multithreading
Java MultithreadingJava Multithreading
Java Multithreading
 
Programming with Threads in Java
Programming with Threads in JavaProgramming with Threads in Java
Programming with Threads in Java
 
Multi-Threading
Multi-ThreadingMulti-Threading
Multi-Threading
 
Thread 1
Thread 1Thread 1
Thread 1
 
[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis[Ruxcon 2011] Post Memory Corruption Memory Analysis
[Ruxcon 2011] Post Memory Corruption Memory Analysis
 
A Life of breakpoint
A Life of breakpointA Life of breakpoint
A Life of breakpoint
 
Intro To .Net Threads
Intro To .Net ThreadsIntro To .Net Threads
Intro To .Net Threads
 
Java multi threading
Java multi threadingJava multi threading
Java multi threading
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk PepperdineJava and the machine - Martijn Verburg and Kirk Pepperdine
Java and the machine - Martijn Verburg and Kirk Pepperdine
 
Threads
ThreadsThreads
Threads
 
Java - A broad introduction
Java - A broad introductionJava - A broad introduction
Java - A broad introduction
 

Dernier

Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaPooja Gupta
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersnarwatsonia7
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...
Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...
Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...rajnisinghkjn
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️saminamagar
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Bookingnarwatsonia7
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Availablenarwatsonia7
 
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...narwatsonia7
 
Glomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxGlomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxDr.Nusrat Tariq
 
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any TimeCall Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Timevijaych2041
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowNehru place Escorts
 
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original PhotosBook Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photosnarwatsonia7
 
Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)Mohamed Rizk Khodair
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.ANJALI
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...saminamagar
 
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...narwatsonia7
 

Dernier (20)

Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Jayanagar Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service NoidaCall Girls Service Noida Maya 9711199012 Independent Escort Service Noida
Call Girls Service Noida Maya 9711199012 Independent Escort Service Noida
 
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbersBook Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
Book Call Girls in Kasavanahalli - 7001305949 with real photos and phone numbers
 
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hosur Just Call 7001305949 Top Class Call Girl Service Available
 
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
Call Girls ITPL Just Call 7001305949 Top Class Call Girl Service Available
 
Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...
Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...
Noida Sector 135 Call Girls ( 9873940964 ) Book Hot And Sexy Girls In A Few C...
 
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️call girls in green park  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
call girls in green park DELHI 🔝 >༒9540349809 🔝 genuine Escort Service 🔝✔️✔️
 
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment BookingCall Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
Call Girl Koramangala | 7001305949 At Low Cost Cash Payment Booking
 
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service AvailableCall Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
Call Girls Hsr Layout Just Call 7001305949 Top Class Call Girl Service Available
 
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
Housewife Call Girls Hsr Layout - Call 7001305949 Rs-3500 with A/C Room Cash ...
 
Glomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptxGlomerular Filtration rate and its determinants.pptx
Glomerular Filtration rate and its determinants.pptx
 
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any TimeCall Girls Viman Nagar 7001305949 All Area Service COD available Any Time
Call Girls Viman Nagar 7001305949 All Area Service COD available Any Time
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
 
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call NowKolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
Kolkata Call Girls Services 9907093804 @24x7 High Class Babes Here Call Now
 
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original PhotosBook Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
Book Call Girls in Yelahanka - For 7001305949 Cheap & Best with original Photos
 
Epilepsy
EpilepsyEpilepsy
Epilepsy
 
Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)Primary headache and facial pain. (2024)
Primary headache and facial pain. (2024)
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.
 
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...call girls in Connaught Place  DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
call girls in Connaught Place DELHI 🔝 >༒9540349809 🔝 genuine Escort Service ...
 
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
Call Girls Frazer Town Just Call 7001305949 Top Class Call Girl Service Avail...
 

Medical Image Processing Strategies for multi-core CPUs

  • 1. Medical image processing strategies for multi-core CPUs Daniel Blezek, Mayo Clinic blezek.daniel@mayo.edu
  • 2. Poll Does your primary computer have more than one core...? 2 Have you ever written parallel code?
  • 3.
  • 5. Development of parallel software is difficult
  • 10. Parallel Computing – according to Google “parallel computing” 1.4M hits on Google “multithreading” 10M hits “multicore” 2.4M hits “parallel programming” 1.1M hits Why is it so hard? the world is parallel we all think in parallel yet we are taught to program in serial 4 driving
  • 11. Degrees of parallelism (my take) Serial – SISD single thread of execution Data parallel – SIMD (fine grained parallelism) Embarrassingly parallel – larger scale SIMD CT or MR reconstruction Each operation is independent, e.g. iFFT of slices Worker thread – e.g. virus scanning software Coarse grained parallelism – SMP or MIMD Focus of this presentation, more in GPU talk Concurrency, OpenMP, TBB, pthreads/Winthreads Large scale – MPI on cluster, tight coupling Large scale – Grid computing, loose coupling 5
  • 12. Pragmatic approach C/C++ and Fortran are the kings of performance (I’ve never written a single line of Fortran, so don’t ask) “Bolted on” parallel concepts Zero language support Huge existing codebase 6
  • 13. Pragmatic approach Briefly touch on SIMD Introduce SMP concepts Threads, concurrency Development models pthreads/WinThreads OpenMP TBB ITK Medical Image Processing Example problems Common errors Next steps 7 packed
  • 15. SIMD – basic principles 9 http://en.wikipedia.org/wiki/SIMD
  • 16. Data structures for SIMD Array of Structures struct Vec { float x, y, z; }; Vec[] points = new Vec[sz]; 10 X Y Z -- Pack X Y Z -- X Y Z -- * Unpack X Y Z --
  • 17. Data structures for SIMD 11 Structure of Arrays struct Vec { float[] x; float[] y; float[] z; Vec ( int sz ) { x = new float[sz]; y = new float[sz]; z = new float[sz]; }; }; Structure of Arrays struct Vec{ Vector4f[] v; Vec ( int sz ) { // must be word // aligned v = new Vector4f[sz]; }; };
  • 18. SIMD pitfalls Structure alignment Usually needs to be aligned on word boundary Structure considerations May need to refactor existing code/structures Generally not cross-platform MMX, 3D Now!, SSE, SSE2, SSE4, AltVec, AVX, etc... Performance gains are modest 2x – 4x common Limited instructions Add, multiply, divide, round Not suitable for branching logic Autovectorizing compilers for simple loops -ftree-vectorize (GCC), -fast, -O2 or -O3 (Intel Compiler) 12
  • 20. 14 Threads – they’re everywhere
  • 21. SMP concepts 15 Useful to think in terms of “cores” 2 dual-core CPU = 4 “cores” Cores share main memory, may share cache Threads in same process share memory Generally, one executing thread per core Other threads sleeping
  • 22. Cores – they’re everywhere 16 How many cores does your laptop have? Mine has 50(!) 2 Intel CPU (Core 2 Duo) 32 nVidia cores (9600M GT) 16 nVidia cores (9400M)
  • 23. Parallel concepts for SMP Process Started by the OS Single thread executes “main” No direct access to memory of other processes Threads Stream of execution under a process Access to memory in containing process Private memory Lifetime may be less than main thread Concurrency Coordination between threads High level (mutex, locks, barriers) Low level (atomic operations) 17
  • 24. Processes & Threads 18 Process Thread NoNo
  • 25. #include <pthread.h> // Thread work function, must return pointer to void void *doWork(void *work) { // Do work return work; // equivalent to pthread_exit ( myWork ); } ... pthread_t child; ... rc=pthread_create(&child, &attr, doWork, (void *)work); ... rc = pthread_join ( child, &threadwork ); ... Thread construction – pthread example 19
  • 26. Thread construction – Win32 example 20 #include <windows.h> DWORD WINAPI doWork( LPVOID work) {}; ... PMYDATA work; DWORD childID; HANDLE child; child = CreateThread( NULL, // default security attributes 0, // use default stack size doWork, // thread function name work, // argument to thread function 0, // use default creation flags &childID); // returns the thread identifier WaitForMultipleObjects(NThreads, child, TRUE, INFINITE);
  • 27. Thread construction – Java example 21 import java.lang.Thread; class Worker implements Runnable { public Worker ( Work work ) {}; public void run() {}; // Do work here } ... Worker worker = new Worker ( someWork ); New Thread ( worker ).start();
  • 28. Race Conditions 22 Serial Parallel Problem! nono/door
  • 29. Mutex Mutex – Mutual exclusion lock Protects a section of code Only one thread has a lock on the object Threads may wait for the mutex return a status if the mutex is locked Semaphore N threads Critical Section One thread executes code Protects global resources Maintain consistent state 23
  • 30. Race Conditions 24 ... N = 0; ... // Start some threads ... void* doWork() { N++; // get, incr, store } Mutexmutex; mutex.lock(); mutex.release(); Solution w/Mutex NoNo
  • 31. Atomic operations Locks are not perfect Cause blocking Relatively heavy-weight Atomic operations Simple operations Hardware support Can implement w/Mutex Conditions Invisibility – no other thread knows about the change Atomicity – if operation fails, return to original state 25
  • 32. Deadlock Deadlock 26 Mutex A Mutex B Mutex Thread NoNo
  • 33. Thread synchronization – barrier Initialized with the number of threads expected Threads signal when they are ready Wait until all expected threads are there A stalled or dead thread can stall all the threads 27
  • 34. Thread synchronization – Condition variables Workers atomically release mutex and wait Master atomically releases mutex and signals Workers wake up and acquire mutex 28 Mutex A Working Condition Mutex A Condition Mutex A Wait Mutex A Condition Mutex A Mutex Thread
  • 35. Thread pool & Futures 29 Maintains a “pool” of Worker threads Work queued until thread available Optionally notify through a “Future” Future can query status, holds return value Thread returns to pool, no startup overhead Core concept for OpenMP and TBB
  • 37. Introduction to OpenMP Scatter / gather paradigm Maintains a thread pool Requires compiler support Visual C++, gcc 4.0, Intel Compiler Easy to adapt existing serial code, easy to debug Simple paradigm 31
  • 38. OpenMP – simple parallel sections 32 #pragmaomp parallel sections num_threads ( 5 ) { // 5 Threads scatter here #pragmaomp section { // Do task 1 } #pragmaomp section { // Do task 2 } ... #pragmaomp section { // Do task N } // Implicit barrier } Barrier ... NoNo
  • 39. OpenMP – parallel for 33 #pragmaomp parallel for for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i doSomeWork( i ); } // Implicit barrier Scheduling the iterations
  • 40. OpenMP – reduction 34 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i & TotalAmountOfWork TotalAmountOfWork += doSomeWork( i ); } // Implicit barrier // TotalAmountOfWork was properly accumulated // Each thread has local copy, barrier does reduction // No need to use critical sections
  • 41. OpenMP – “atomic” reduction 35 int TotalAmountOfWork = 0; #pragmaomp parallel for for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here int myWork = doSomeWork( i); #pragmaomp atomic TotalAmountOfWork += myWork; } // Implicit barrier // TotalAmountOfWork was properly accumulated // However, the atomic section can cause thread stalls
  • 42. OpenMP – critical 36 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i ); #pragmaomp critical { // Execute by one thread at a time, e.g., “Mutex lock” criticalOperation(); } } // Implicit barrier
  • 43. OpenMP – single 37 int TotalAmountOfWork = 0; #pragmaomp parallel for reduction ( + : TotalAmountOfWork ) for ( int i = 0; i < NumberOfIterations; i++ ) { // Threads scatter here // each thread has a private copy of i TotalAmountOfWork += doSomeWork( i ); #pragmaomp single nowait { // Execute by one thread, use “master” for the main thread reportProgress ( TotalAmountOfWork ); } // !! No implicit barrier because of “nowait” clause !! } // Implicit barrier
  • 45. Introduction to TBB Commercial and Open Source Licenses GPL with runtime exception Cross-platform C++ library Similar to STL Usual concurrency classes Several different constructs for threading for, do, reduction, pipeline Finer control over scheduling Maintains a thread pool to execute tasks http://www.threadingbuildingblocks.org/ 39
  • 46. TBB – parallel for 40 #include "tbb/blocked_range.h” #include "tbb/parallel_for.h” class Worker { public: Worker ( /* ... */ ) {...}; void operator() ( const tbb::blocked_range<int>& r ) const { for ( int i = r.begin(); i != r.end(); ++i ) { doWork ( i ); } } }; ... tbb::parallel_for ( tbb::blocked_range<int> ( 0, N ), Worker ( /* ... */ ), tbb::auto_partitioner() );
  • 47. TBB – parallel reduction 41 #include "tbb/blocked_range.h” #include "tbb/parallel_reduce.h” class ReducingWorker { int mLocalWork; public: ReducingWorker ( /* ... */ ) {...}; ReducingWorker ( ReducingWorker& o, split ) : mLocalWork(0) {}; void join ( const ReducingWorker& o ) {mLocalWork += o.mLocalWork}; void operator() ( const tbb::blocked_range<int>& r ) { ... } }; ... Worker w; tbb::parallel_reduce ( tbb::blocked_range<int> ( 0, N ), w, tbb::auto_partitioner() ); w.getLocalWork();
  • 48. TBB – parallel reduction 42
  • 49. TBB – synchronization 43 tbb::spin_mutex MyMutex; void doWork ( /* ... */ ) { // Enter critical section, exit when lock goes out of scope tbb::spin_mutex::scoped_lock lock ( MyMutex ); // NB: This is an error!!! // tbb::spin_mutex::scoped_lock( MyMutex ); } ... #include <tbb/atomic.h> tbb::atomic<int> MyCounter; ... MyCounter = 0; // Atomic int i = MyCounter; // Atomic MyCounter++; MyCounter--; ++MyCounter; --MyCounter; // Atomic ... MyCounter = 0; MyCounter += 2; // Watch out for other threads!
  • 51. ITK Implementation Threads operate across slices Only implemented behavior in ITK itk::MultiThreader is somewhat flexible Requires that you break the ITK model Uses Thread Join, higher overhead No thread pool 45
  • 52.
  • 53.
  • 55. Image class 48 class Image { public: short* mData; int mWidth, mHeight, mDepth; int mVoxelsPerSlice; int mVoxelsPerVolume; short* mSlicePointers; // Pointers to the start of each slice short getVoxel ( int x, int y, int z ) {...} void setVoxel ( int x, int y, int z, short v ) {...} };
  • 56. Trivial problem – threshold Threshold an image If intensity > 100, output 1 otherwise output 0 Present from simple to complex OpenMP TBB ITK pthread(see extra slides) 49
  • 57. Threshold – OpenMP #1 50 void doThreshold ( Image* in, Image* out ) { #pragmaomp parallel for for ( int z = 0; z < in->mDepth; z++ ) { for ( int y = 0; y < in->mHeight; y++ ) { for ( int x = 0; x < in->mWidth; x++ ) { if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } // NB: can loop over slices, rows or columns by moving // pragma, but must choose at compile time
  • 58. Threshold – OpenMP #2 51 void doThreshold ( Image* in, Image* out ) { #pragmaomp parallel for for ( int s = 0; s < in->mVoxelsPerVolume; s++ ) { if ( in->mData[s] > 100 ) { out->mData[s] = 1; } else { out->mData[s] = 0; } } } // Likely a lot faster than previous code
  • 59. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) { for ( int x = r.begin(); x != r.end(); ++x ) { if ( in->mData[x] > 100 ) { out->mData[x] = 1; } else { out->mData[x] = 0; } } } } ... parallel_for ( tbb::blocked_range<int>(0, in->mVoxelsPerVolume ), Threshold ( in, out ), auto_partitioner() ); // NB: default “grain size” for blocked_range is 1 pixel // tbb::blocked_range<int>(..., in->mVoxelsPerVolume / NumberOfCPUs ) Threshold – TBB #1 52
  • 60. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) { for ( int z = in->mDepth; z < in->mDepth; z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } }; ... parallel_for ( tbb::blocked_range2d<int,int>( 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #2 53
  • 61. class Threshold { public: Threshold ( Image* in, Image* o ) : in ( i ), out ( o ) {...} void operator() ( const tbb::blocked_range<int>& r ) {...} void operator() ( const tbb::blocked_range2d<int,int>& r ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ if ( in->getVoxel(x,y,z) > 100 ) { out->setVoxel(x,y,z,1); } else { out->setVoxel(x,y,z,0); } } } } } }; ... parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth, 1 0, in->mHeight, 32 0, in->mWidth, 32 ), Threshold ( in, out ), auto_partitioner() ); Threshold – TBB #3 54
  • 62. Threshold – ITK solution 55 ThreadedGenerateData( const OutputImageRegionType out, int threadId) { ... // Define the iterators ImageRegionConstIterator<TIn> inputIt(inputPtr, out); ImageRegionIterator<TOut> outputIt(outputPtr, out); inputIt.GoToBegin(); outputIt.GoToBegin(); while( !inputIt.IsAtEnd() ) { if ( inputIt.Get() > 100 ) { outputIt.Set ( 1 ); } else { outputIt.Set ( 0 ); { ++inputIt; ++outputIt; } }
  • 63. Interesting problem – anisotropic diffusion Edge preserving smoothing method Perona and Malik. Scale-space and edge detection using anisotropic diffusion. Pattern Analysis and Machine Intelligence, IEEE Transactions on (1990) vol. 12 (7) pp. 629 – 639 Iterative process Demonstrate OpenMP TBB (ITK has an implementation) (pthreads are tedious at the very least) Pop quiz – are the following correct? 56
  • 64. Anisotropic diffusion – OpenMP 57 void doAD ( Image* in, Image* out ) { #pragmaomp parallel for for ( int t = 0; t < TotalTime; t++ ) { for ( int z = 0; z < in->mDepth; z++ ) { ... } } }
  • 65. Anisotropic diffusion – OpenMP 58 void doAD ( Image* in, Image* out ) { short *previousSlice, *slice, *nextSlice; for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { previousSlice = in->mSlicePointers[z-1]; slice = in->mSlicePointers[z]; nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
  • 66. Anisotropic diffusion – OpenMP 59 void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int z = 1; z < in->mDepth-1; z++ ) { short* previousSlice = in->mSlicePointers[z-1]; short* slice = in->mSlicePointers[z]; short* nextSlice = in->mSlicePointers[z+1]; for ( int y = 1; y < in->mHeight-1; y++ ) { short* previousRow = slice + y-1 * in->mWidth; short* row = slice + y * in->mWidth; short* nextRow = slice + y-1 * in->mWidth; short* aboveRow = previousSlice + y * in->mWidth; short* belowRow = nextSlice + y * in->mWidth; for ( int x = 1; i < in->mWidth-1; x++ ) { dx = 2 * row[x] – row[x-1] – row[x+1]; dy = 2 * row[x] – previousRow[x] – nextRow[x]; dz = 2 * row[x] – aboveRow[x] – belowRow[x]; ...
  • 67. Anisotropic diffusion – TBB #1 60 class doAD { public: static ADConstants* sConstants; doAD ( Image* in, Image* out ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { if ( !sConstants == NULL ) { initConstants(); } // process ... } }
  • 68. Threshold – TBB #2 61 class doAD { public: doAd ( ... ) {...} void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } }; ... parallel_for ( tbb::blocked_range3d<int,int,int>(0, in->mDepth 0, in->mHeight 0, in->mWidth ), doAD ( in, out ), auto_partitioner() );
  • 69. Threshold – TBB #3 62 class doAD { public: static tbb::atomic<int> sProgress; tbb::spin_mutexmMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } }; ... doAD::sProgress = 0; parallel_for (...);
  • 70. Threshold – TBB #4 63 class doAD { public: static tbb::atomic<int> sProgress; static tbb::spin_mutexmMutex; doAd ( ... ) {...} void reportProgress ( int p ) { ... } void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { for ( int z = r.pages().begin(); z != r.pages().end(); z++ ) { tbb::spin_mutex::scoped_lock lock ( mMutex ); sProgress++; reportProgress ( sProgress ); for ( int y = r.rows().begin(); y != r.rows.end(); y++ ) { for ( int x = r.cols().begin(); x != r.cols().end(); x++ ){ ... } }; ... doAD::sProgress = 0; parallel_for (...);
  • 71. nowait Anisotropic diffusion – OpenMP (Progress) 64 using std; void doAD ( Image* in, Image* out ) { int progress = 0; for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int s = 0; s < in->mDepth; s++ ) { #pragmaomp atomic progress++; #pragmaomp single reportProgress ( progress ); ... } } }
  • 72. Real-life problem Compute Frangi’svesselness measure Frangi et al. Model-based quantitation of 3-D magnetic resonance angiographic images. IEEE Transactions on Medical Imaging (1999) vol. 18 (10) pp. 946-956 Memory constrained solution ITK implementation requires 1.2G for 100M volume Antiga. Generalizing vesselness with respect to dimensionality and shape. Insight Journal (2007) Possible solutions using OpenMP, TBB 65
  • 74. ITK Implementation – computing the Hessian 6 volumes computed in serial Individual filters are threaded Good CPU usage High memory requirements 67
  • 75. Design considerations Break problem into blocks Compute hessian, eigenvalues, and vesselness Reduces memory requirements Incurs overhead, boundary conditions 68
  • 76. Design considerations 69 keep cpu’s full
  • 77. Design considerations – boundary condition 70
  • 79. Algorithm sketch – Serial 72 intBlockSize = 32; for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } } }
  • 80. Algorithm sketch – OpenMP 73 intBlockSize = 32; #pragmaomp parallel for for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } } } Each thread is on a different slice May cause cache contention Similar problems for “y” direction
  • 81. Algorithm sketch – OpenMP 74 intBlockSize = 32; for ( intz = 0; z < image->mDepth; z += BlockSize ) { for ( inty = 0; y < image->mHeight; y += BlockSize ) { #pragmaomp parallel for for ( intx = 0; x < image->mWidth; x += BlockSize ) { processBlock ( in, out, x, y, z, BlockSize ); } } } All threads on same rows May not utilize all CPUs If Ratio of Width to BlockSize < # CPUs Better cache utilization
  • 82. Algorithm sketch – TBB 75 class Vesselness { public: void operator() ( const tbb::blocked_range3d<int,int,int>& r ) { // Process the block, could use ITK here processBlock ( r.cols().begin(), r.rows().begin(), r.pages().begin(), r.cols().size(), r.rows().size(), r.pages().size() ); ... parallel_for ( tbb::blocked_range3d<int,int,int>( 0, in->mDepth, 32 0, in->mHeight, 32 0, in->mWidth, 32 ), Vesselness( in, out ), auto_partitioner() ); Individual blocks Full CPUs May not have best cache performance
  • 83. Next steps Go try parallel development Try threads to gain understanding and insight Next OpenMP, adapting existing code TBB: more constructs, different approachs Experiment with new languages Erlang, Scala, Reia, Chapel, X10, Fortress... Check out some of the resources provided Have fun! It’s a brave new world out there... 76
  • 84. Resources TBB (http://www.threadingbuildingblocks.org/) OpenMP (http://openmp.org/wp/) Books/Articles Java Concurrency in Practice (http://www.javaconcurrencyinpractice.com/) Parallel Programming (http://www-users.cs.umn.edu/~karypis/parbook/) ITK Software Guide (http://www.itk.org/ItkSoftwareGuide.pdf) The Problem with Threads (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf) Tutorials Parallel Programming(https://computing.llnl.gov/tutorials/parallel_comp/) pthreads (https://computing.llnl.gov/tutorials/pthreads/) OpenMP (https://computing.llnl.gov/tutorials/openMP/) Other LLNL (https://computing.llnl.gov/) Erlang (http://en.wikipedia.org/wiki/Erlang_programming_language) GCC-OpenMP (http://gcc.gnu.org/projects/gomp/) Intel Compiler (http://software.intel.com/en-us/intel-compilers/) 77
  • 85. Resources Languages Erlang (http://www.erlang.org/) Scala (http://www.scala-lang.org/) Chapel (http://chapel.cs.washington.edu/) X10 (http://x10-lang.org/) Unified Parallel C (http://upc.gwu.edu/) Titanium (http://titanium.cs.berkeley.edu/) Co-Array Fortran (http://www.co-array.org/) ZPL (http://www.cs.washington.edu/research/zpl/home/index.html) High Performance Fortran (http://hpff.rice.edu/) Fortress (http://projectfortress.sun.com/Projects/Community/) Others (http://www.google.com/search?q=parallel+programming+language) 78
  • 86. Medical image processing strategies for multi-core CPUs Daniel Blezek, Mayo Clinic blezek.daniel@mayo.edu
  • 87. Thread construction – pthread example 80 include <pthread.h> void *(*start_routine)(void *); int pthread_create(pthread_t *restrict thread, const pthread_attr_t *restrict attr, void *(*start_routine)(void *), void *restrict arg); void pthread_exit(void *value_ptr); int pthread_join(pthread_t thread, void **value_ptr);
  • 88. Mutex – pthread example 81 #include <pthread.h> pthread_mutex_t myMutex; ... pthread_mutex_init ( &myMutex, NULL ); ... pthread_mutex_lock ( &myMutex ); // Critical Section, only one thread at a time ... pthread_mutex_unlock ( &myMutex ); ... if ( pthread_mutex_trylock ( &myMutex ) == EBUSY ) { // We did get the lock, so we are in the critical section ... pthread_mutex_unlock ( &myMutex ); }
  • 89. Mutex – Java example 82 import java.lang.*; class Foo { public synchronized int doWork () { // only one thread can execute doWork } Object resource; public int otherWork () { synchronized ( resource ) { // critical section, resource is the mutex ... } }
  • 90. struct Work { Image* in; Image *out; int start; int end; }; Work workArray[THREADCOUNT]; pthread_t thread[THREADCOUNT]; void* doThreshold ( void* inWork ) { Work* work = (Work*) inWork; for ( int s = work->start; s < work->end; s++ ) {...} } ... pthread_attr_t attributes; pthread_attr_init ( &attributes ); pthread_attr_setdetachstate ( &attributes, PTHREAD_CREATE_JOINABLE ); for ( int t = 0; t < THREADCOUNT; t++ ) { initializeWork ( in, out, t, workArray[t] ); pthread_create ( &thead[t], &attributes, doThreshold, (void*) workArray[t] ); } for ( int t = 0; t < THREADCOUNT; t++ ) { pthread_join ( thread[t], NULL ); } Threshold – pthread 83
  • 92. Semaphore Allow N threads access Protects limited resources Binary semaphore N = 1 Equivalent to Mutex 85
  • 93. ITK Implementation Threads operate across slices Only implemented behavior in ITK itk::MultiThreader is somewhat flexible Requires that you break the ITK model Uses Thread Join, higher overhead No thread pool 86
  • 94. ITK – itk::MultiTheader 87 #include <itkMultiThreader.h> // Win32 DWORD doWork ( LPVOID lpThreadParameter ); // Pthread - Linux, Mac, Unix void* doWork ( void* inWork ); itk::MultiThreader::Pointerthreader = itk::MultiThreader::New(); threader->SetNumberOfThreads ( NumberOfThreads ); for ( int i = 0; i < NumberOfThreads; i++ ) { threader->SetMultipleMethod ( i, doWork, (void*) work[i] ); } // Explicit barrier, waits for Thread join threader->MultipleMethodExecute();
  • 95. #include <itkImageToImageFilter.h> template <In, Out> Worker : public ImageToImageFilter<In, Out> { ... void BeforeThreadedGenerateData() { // Master thread only ... } void ThreadedGenerateData(constOutputImageRegionType &r, int tid ){ // Generate output data for r ... } voidAfterThreadedGenerateData() { // Master thread only ... } // Output split on last dimension // i.e. Slices for 3D volumes Insight Toolkit 88
  • 96. Anisotropic diffusion – OpenMP 89 using std; void doAD ( Image* in, Image* out ) { for ( int t = 0; t < TotalTime; t++ ) { #pragmaomp parallel for for ( int slice = 0; slice < in->mDepth; slice++ ) { ... } } }

Notes de l'éditeur

  1. If I had asked this question 5 years ago, almost no one would have raised their hand.
  2. Driving is inherently a parallel task, we coordinate at stop signs, stop lights, we obey the rules of the road, but we can get deadlocked (grid lock).