Beyond Parametric - New Approach to Geometric Constraint Solving
Generation of Planar Radiographs from 3D Anatomical Models Using the GPU
1. Generation of planar radiographs from 3D
anatomical models using the GPU
André dos Santos Cardoso
Supervisor: Jorge M. G. Barbosa
University of Porto
Faculty of Engineering of University of Porto
11th February, 2011
1/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
1/27
2. Contents
Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
2/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
2/27
3. Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
2/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
2/27
5. DRRs – Why?
• Shape recovery of human spine
◦ 100s of DRRs per second
• Scoliosis Evaluation
◦ Alternative to MRIs and CTs
4/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
4/27
6. Project’s Objective
Build Fast DRR Algorithms
• Common bottleneck!
◦ Applications in medical area – high throughputs are demanded
• Take advantage new GPUs and APIs
◦ Common workstations could do the job!
5/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
5/27
7. Existing Solution – GLSL
• GLSL implementation – multi-pass working solution
• Depth Peeling Based – Cass Everitt, Interactive
Order-Independent Transparency
• Let’s try to enhance its performance!!
6/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
6/27
8. Algorithm Concepts
Image Plane
Obje
ct
P4
P3
P2
Object
P1
X-ray
source
Problem!
Potential Artifact Generation!
• Each ray traverses the object
◦ Energy is attenuated
PixelColor = exp ((||P2 − P1 || + ||P4 − P3 ||) × AttenuationFactor )
• Common edges may lead to artifact generation!
7/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
7/27
9. Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
7/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
7/27
10. CUDA Platform
• Compute Unified Device Architecture
◦ Parallel Computing Architecture
◦ Exposes GPU functions and memory
◦ SIMT execution model
◦ Allows hierarchical configuration of
threads
• Cheap threads, dozens/hundreds of cores
◦ Thousands of concurrent threads!
• GeForce GT 240
◦ 96 cores
◦ 12288 active threads
8/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
8/27
11. CUDA Platform – Threading and Memory
9/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
9/27
12. Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
9/
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
9/27
13. Inputs for Our Algorithms
• Geometry file – the
vertebrae models
10 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
10/27
14. Inputs for Our Algorithms
• Geometry file – the
vertebrae models
10 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
10/27
15. Inputs for Our Algorithms
• Camera Calibration Matrix
10 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
10/27
16. Inputs for Our Algorithms
αu λ u0
C = 0 αv v0
0 0 1
f 0 0 0
• Camera Calibration Matrix P= 0 f 0 0
0 0 1 0
R t
K=
0T 1
3
X
u Y
s v = C.P.K.
Z
1
Figure: Pinhole Model 1 10 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
10/27
17. Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
10 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
10/27
22. Pre-Processing Steps
1. 2D Bounding Box
2. (Projection Source)
3. Ray Direction
(for each pixel)
◦ R(t) = O + tD
11 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
11/27
23. Pre-Processing Steps
1. 2D Bounding Box
2. (Projection Source)
3. Ray Direction
(for each pixel)
◦ R(t) = O + tD
11 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
11/27
24. Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
11 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
11/27
25. Image Order Approach
1 Thread for Each Pixel
• Thread ⇐⇒ Ray
• Thread loops over ALL triangles
• Ray Casting! ◦ Tests intersections between ray and
triangle
◦ Acumulates distances to source
along ray path
12 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
12/27
26. Image Order Approach – Problems
1. Many threads looping
over many triangles L3 Vertebra Model
• 776 vertices, 1552 triangles
• PA perspective: 266 × 138 pixels =
36708 threads
13 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
13/27
27. Image Order Approach – Problems
1. Many threads looping
over many triangles
2. Useless intersection
tests – heavy
operations!
13 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
13/27
28. Image Order Approach – Problems
1. Many threads looping
over many triangles
2. Useless intersection
tests – heavy
operations!
3. Artifacts – hard to take
care of!
13 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
13/27
29. Image Order Approach – Results
• L3 vertebra model
• PA camera – 265 × 137
pixels
• GPU time only!
• Incomplete implementation
SLOW!
14 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
14/27
30. Object Order Approach
1 Thread for Each Triangle
• Ray Casting! • Thread loops over each pixel covered
• Threads spanned for by the triangle bounding box
each triangle ◦ Tests intersections between ray and
◦ Reverse the approach triangle
of the former ◦ Acumulates distances to source
algorithm! along ray path
• Concurrency problems!
15 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
15/27
31. Object Order Approach – Problems
1. Concurrency problems on
Concurrent Threads
pixel data.
◦ Fang Liu et al, FreePipe:
a programmable parallel int index = atomicInc(sharedCounter);
rendering architecture for
efficient multi-fragment
Pixel Bu er
effects
16 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
16/27
32. Object Order Approach – Problems
1. Concurrency problems on
pixel data.
2. Still many intersection
tests
16 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
16/27
33. Object Order Approach – Problems
1. Concurrency problems on
pixel data.
2. Still many intersection
tests
3. Artifacts still hard to avoid
or correct
16 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
16/27
34. Object Order Approach – Results
• L3 vertebra model
• PA camera – 265 × 137
pixels
• GPU time only!
• Incomplete implementation
SLOW!
17 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
17/27
35. Multi-depth Approach - Principle
Assume a Simplification
• Discard the Euclidean distance between intersections!
• Consider only distance between Fragments, along depth axis!!
P2
d1
P1
P’2
d2 P’1
Source
18 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
18/27
36. Multi-depth Approach - Pipeline
• Rasterization done using Scanline+Bresenham algorithm
◦ Filling convention avoids artifacts :) !
• Interpolation in Integer interval
Z −Zmin
◦ Depth = Zmax −Zmin × INT _MAX
• Saving depth in pixel array, raises concurrency problems (again)!
19 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
19/27
37. Multi-depth Approach - Depth array
Ordering
atomicMin inserts in right place
1: initializeDepthArrays(MAX _INTEGER)
2: Znew ← interpolateDepth()
3: for i = 0 to DEPTH_ARRAY _SIZE − 1 do
4: Zold ← atomicMin(&(getPixelDepthArray (u, v , i)), Znew )
5: if Zold == MAX _INTEGER then
6: break
7: end if
8: Znew ← fmaxf (Znew , Zold)
9: end for
• Fang Liu et al, FreePipe: a programmable parallel rendering
architecture for efficient multi-fragment effects
20 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
20/27
38. Multi-depth Approach - Results
• Best time:
◦ 202 × 132 pixels
◦ GPU + CPU time!
◦ Performance With and
Without DRR transfer to
host!
BETTER! 21 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
21/27
39. Multi-depth Optimization
• Multi-depth allows for an ordered set of depths
◦ More depths =⇒ more atomicMin() calls
We can postpone depth Ordering...
1: index ← atomicInc(&counter, INT_MAX)
2: depthArray [index ] ← Znew // RAW-hazard free!!!!
• depthArray has all the depth values;
◦ Ordering can be done on a post-processing kernel!!!
22 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
22/27
40. Multi-depth Optimization
Concurrent Threads
int index = atomicInc(sharedCounter);
Pixel Bu er
22 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
22/27
42. Multi-depth Optimization – Results
Better than Current Solution
23 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
23/27
43. Introduction and Context
CUDA Platform
Input Data
Pre-Processing Steps
Developed Algorithms
Conclusion
23 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
23/27
44. Conclusion
• CUDA implementations for DRR extraction
◦ Both pre-processing and main computation tasks
◦ Artifact-free
• Single geometry pass
• Shared memory model
◦ May be adapted to other technologies
• Final implementation shows better performance than GLSL
24 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
24/27
45. Future Work
There’s a Big Chart to Fill Up...
25 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
25/27
46. Future Work
• Still some artifacts
• Memory operations optimizations
• Comparisons with other implementations, other geometry
models
• Build a DRR generation library
◦ possibly an open-source project
• Participation in IJUP’11 • Paper preparation for
VIPIMAGE 2011. Abstract
Deadline: 15th March.
26 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
26/27
47. Thank You for Listening!
Ask Away!
27 /
André Cardoso andre.cardoso@fe.up.pt DRR Synthesis Algorithms 27
27/27