Improving the performance of OpenSubdiv* on Intel Architecture

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Sheng Fu
(sheng.fu@intel.com)
August 12, 2015
Improving the performance of
OpenSubdiv* on Intel Architecture

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS
COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH
MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves
these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this
information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar
performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other
platforms, and assigning them a relative performance number that correlates with the performance improvements reported.
SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See
http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.
Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and
software you use. For more information including details on which processors support HT Technology, see here
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration.
Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
No computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system manufacturer and/or software vendor for more information.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to:
Learn About Intel® Processor Numbers
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.
Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and
products specified are for planning purposes only and are subject to change without notice
*Other names and brands may be claimed as the property of others.
Legal Disclaimers

The above statements and any others in this document that refer to plans and expectations for the third quarter, the year and the future are forward-looking statements that
involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations
identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many
factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those
expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the
company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of
Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes
in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to
negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by
a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross
margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors,
including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to
technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity
utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the
timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of
materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's
results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including
military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain
marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of
revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with
product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust,
disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an
injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or
requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in
Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.
Risk Factors

 Introduction to OpenSubdiv*
 Optimizing subdivision kernel with ICC
 Optimizing patch evaluation with ISPC
 Embree Viewer: a demo to render animated subdivision
surface interactively on Intel architecture
Agenda

 Start from a polygon control mesh
 Apply subdivision rule recursively to get the limit surface
What is a subdivision surface

 Support arbitrary topology
 Smooth
 Deform efficiently for animation
Why have subdivision surfaces been
extensively used in the DCC industry?

 Open source libraries that implement high performance
subdivision surface evaluation on CPU and GPU
 Optimized for drawing deforming surfaces with static topology at
interactive frame rates
 Match the RenderMan* specification
What is OpenSubdiv*?

Pipeline to render subdivision surfaces
Feature adaptive
subdivision to get patches
Evaluate patches to
tessellate patches into
triangles
Render triangle meshes
Control mesh
Patches
triangles

Optimizing a subdivision kernel
• How a subdivision kernel works:
Compute vertex data for a vertex in
the new level by summing weighted
vertex data of surrounding vertices
in the current level
v1, w1 v2, w2
v3, w3 v4, w4
vnew=v1*w1+v2*w2+v3*w3+v4*w4
for (int i=start; i<end; ++i) {
for (int k = 0; k<numElems; ++k)
result[k] = 0.0f;
for (int j=0; j<sizes[i]; ++j, ++indices, ++weights) {
src = vertexSrc + (*indices)*numElems;
weight = *weights;
for (int k=0; k<numElems; ++k) {
result[k] += src[k] * weight;
}
}
dst = vertexDst + i*numElems;
memcpy(dst, result, numElems*sizeof(float));
}

Vectorizing a subdivision kernel with ICC
auto vectorization
• What is vectorization?
• Converts scalar code to SIMD code
• What is ICC auto-vectorization?
• ICC automatically identifies and generates packed SIMD instructions to
unroll a loop
• Only the most inner loop can be auto-vectorized
• Use pragmas to help ICC vectorize the loop
#pragma ivdep, #pragma SIMD, #pragma vector align …

Optimizing a subdivision kernel with ICC
for (int i=start; i<end; ++i) {
for (int k = 0; k<numElems; ++k)
result[k] = 0.0f;
for (int j=0; j<sizes[i]; ++j, ++indices, ++weights) {
src = vertexSrc + (*indices)*numElems;
weight = *weights;
#pragma simd
#pragma vector aligned
}
}
dst = vertexDst + i*numElems;
memcpy(dst, result, numElems*sizeof(float));
}
This loop got vectorized
Accumulated in a local
variable to avoid extra
memory copy

• Align vertex data to get better
performance
• vertex data must be aligned
on 4 floats or 8 floats
• Subdivision kernel uses a
template to remove the cost of
the loop
When numElems is a constant of 4
or 8, the highlighted loop can be
converted to two SIMD multiply
and add instructions, or one FMA
instruction.
template <int numElems> void
ComputeStencilKernel(
……….
#pragma simd
#pragma vector aligned
}
………..
}

Subdivision kernel time, collected from glViewer, CPU
kernel, subd level = 2, tessellation level = 1
Data collected on 2 socket 20 core IvyBridge
GCC 1.6ms 0.5ms 0.9ms
ICC 0.46ms 0.15ms 0.24ms
Speedup 3.5x 3.3x 3.8x

Parallelize a subdivision kernel with TBB
• TBB is an open-source task-based parallel-
programming library
• The OpenSubdiv* TBB kernel uses TBB
parallel_for to parallelize the subdivision kernel
• TBB parallel_for can also be used on the subd
mesh level to achieve better load balancing
Run tbb parallel_for on an array of sub mesh
{
Run tbb parallel_for an array of subdivision kernel
{
}
}
Pseudo code for nested parallel_for

Parallelize a subdivision kernel with TBB
VTune Amplifier threads profiling
(collected on 20 core IvyBridge)
CPU utilization for nested
parallel_for
CPU utilization for parallel_for
only on subdivision kernel
Performance result:
Total number of meshes: 222
Minimum control face number: 28
Maximum control face number: 60,038
Wall clock time for
“parallel for only on
subdivision level” 5ms
Wall clock time for
“nested parallel for” 2.6ms
(2x speedup)

Optimizing patch evaluation with ISPC
Feature adaptive
Evaluate patches to
triangles
Control faces
Patches
triangles
Subdivision surface render pipeline

Step1: bundle sample points for the same patch
Benefit of bundling sample points:
• Only need to gather vertex data for a patch once
• Get ready for evaluating patch with SIMD
The data layout in the patch coordinate buffer for bundled samples
points:Array
Index
1
Patch
Index
1
Vertex
Index
1
S
1
T
1
PatchCoord
Array
Index
1
Patch
Index
1
Vertex
Index
1
S
2
T
2
PatchCoord for one patch
Array
Index
n
Patch
Index
n
Vertex
Index
n
S T
PatchCoord
PatchCoord for another patch

Step2: Evaluate a patch with ISPC
What is Intel SPMD (single program multiple data) Program Compiler (ISPC) ?
• An open-source language and compiler for Intel SIMD architectures
• ISPC is NOT an “autovectorizing” compiler!
• It does not generate vector code by analyzing and transforming scalar loops,
such as ICC.
• ISPC is more of a “WYSIWYG” vectorizing compiler
• The programmer tells ISPC what is vector and what is scalar
• Vector types are explicit, not discovered.

Step2: Evaluate a patch with ISPC
ISPC Language
• Familiar C-based syntax
• Code like sequential algorithms, but executes in parallel (SPMD)
• Easily mixes scalar and vector computation
• Two new type modifiers (uniform and varying) distinguish between scalar
and vector data types
• Easily interoperates with C/C++
• You can call C/C++ code from ISPC functions, or call ISPC code from C/C++ code
• Passing pointers between ISPC and C/C++ code just works
• Efficient data layout

ISPC Example – an ISPC function
Export void simple(uniform float vin[], uniform float vout[],
uniform int count) {
foreach (int index = 0 ... count) {
varying float v = vin[index];
if (v < 3.)
v = v * v;
else
v = sqrt(v);
vout[index] = v;
}
}
Visible from C Scalar input type
The “foreach” statement provides
automatic multi-dimensional traversal
of iteration space, optimal code
generation for fully-vectorized
iterations, and automatic remainder
loop generation
Vector type
Varying (default) loop index, so
“vector-width” number of iterations
are done at once (depending on
compile target), one loop iteration per
vector “lane” with masking

ISPC Example – C code that calls it
#include <stdio.h>
#include "simple.h"
int main() {
float vin[16], vout[16];
for (int i = 0; i < 16; ++i)
vin[i] = i;
simple(vin, vout, 16);
for (int i = 0; i < 16; ++i)
printf("%d: simple(%f) = %fn", i, vin[i], vout[i]);
}
Call ISPC Function
0: simple(0.000000) = 0.000000
1: simple(1.000000) = 1.000000
2: simple(2.000000) = 4.000000
3: simple(3.000000) = 1.732051
...

ISPC patch evaluation: gathering control points
uniform Point controlVertices[16];
for(uniform int i=0; i<16; i++) {
uniform unsigned int id = vertexIndices[i];
uniform const float * uniform pVertex;
pVertex = inQ + inDesc.offset + id * inDesc.stride;
controlVertices[i].x = pVertex[0];
controlVertices[i].y = pVertex[1];
controlVertices[i].z = pVertex[2];
pVertex += 3;
}
• Gathering only needs to
be done once for each
patch, since sampling
points are sorted by patch
handle
• Data are uniform, no
SIMD yet.

ISPC patch evaluation: vectorized patch evaluation
foreach( n = 0 ... nPoint) {
float sWeights[4], tWeights[4];
getBSplineWeights(s, sWeights);
getBSplineWeights (t, tWeights);
float weight[16];
for (uniform int i = 0; i < 4; ++i)
for (uniform int j = 0; j < 4; ++j) {
weight[4*i+j] = sWeights[j] * tWeights[i];
}
float *pOutQ = outQ + outDesc.offset + n * outDesc.stride;
for(uniform int c=0; c<nChannel; c++) {
uniform int offset = c * 16;
Point Q;
Q.x = Q.y = Q.z = 0.0;
for (uniform int i=0; i<16; ++i) {
Q = Q + weight[i] * controlVertices[offset + i];
}
*pOutQ ++ = Q.x, *pOutQ ++ = Q.y, *pOutQ ++ = Q.z;
}
}
inline void
getBSplineWeights(float t, float point[4]) {
float const one6th = 1.0f / 6.0f;
float t2 = t * t;
float t3 = t * t2;
point[0] = one6th * (1.0f - 3.0f*(t -t2) -t3);
point[1] = one6th * (4.0f - 6.0f*t2 + 3.0f*t3);
point[2] = one6th * (1.0f + 3.0f*(t +t2 -t3));
point[3] = one6th * ( t3);
}

Parallelize ISPC patch evaluation with TBB
tbb::blocked_range<int> range = tbb::blocked_range<int>(0, numPatchCoords, grain_size);
tbb::parallel_for(range, [&](const tbb::blocked_range<int> &r)
{
int i = r.begin();
while (i < r.end()) {
int nCoord = 1;
Far::PatchTable::PatchHandle handle = patchCoords[i].handle;
while(i + nCoord < r.end() && handle.isEqual(patchCoords[i + nCoord].handle) )
nCoord ++;
__declspec( align(64) ) float u[nCoord], v[nCoord];
for(int n=0; n<nCoord; n++)
u[n] = patchCoords[i + n].s; v[n] = patchCoords[i + n].t;
ispc::evalPatch(nCoord, u, v, …);
i += nCoord;
}
});
Search sampling
points that belong
to the same patch
Put UV into a local
array
Call ispc evaluation
function
Run tbb
parallel_for on all
sampling points

ISPC patch evaluation performance data
Sampling
points
ISPC TBB
Single
Thread(ms)
20 Threads(ms) Single
Thread(ms)
20 Threads
65536 2.5(3.6x) 0.5(1.25x) 7.1 0.6
655360 12(5.8x) 3.0(2.1x) 70 6.3
Performance data collected in glLimitEval, subdivision level = 3,
vertex animation is turned off, CPU: two socket 20 core IvyBridge

Demo: Embree Viewer
Feature adaptive
Evaluate patches to
triangles
Control faces
Patches
triangles
Subdivision surface render pipeline

Demo: Embree Viewer
• Similar demo as glViewer
• Complete CPU based solution: subdivision, tessellation, and
rendering are all on the CPU
• Ray tracing-based rendering with Embree, an open source ray
tracing kernel
• High quality rendering, support shadows.

Demo: Embree Viewer
• Step1: feature-adaptive subdivision to generate patches with the
TBB subdivision kernel
patch1 patch2
patch3 patch4
patch5
patch6
patch7
Patches generated with
subdivision level = 1

Demo: Embree Viewer
• Step2: uniformly tessellate patch into triangles, using ispcEvaluator
to evaluate tessellation points on the limit surface
Patched tessellated with
tessellation level = 1

Demo: Embree Viewer
Step3: render mesh with Embree
• Create an Embree scene: rtcNewScene
• Create a Embree triangle mesh: rtcNewTriangleMesh
• Pass the vertex buffer and index buffer representing the
triangle mesh to Embree: rtcSetBuffer
• Update BVH when mesh positions are updated: rtcUpdate
• Build Embree BVH: rtcCommit

Demo: Embree Viewer
Step3: render mesh with Embree
• Divide screen to 8x8 tiles, and use TBB parallel_for to render
each tile in parallel
• Fire 8 packed primary rays and test for intersections using
SIMD: rtcIntersect8
• Fire a shadow ray for each intersected ray: rtOccluded

Demo: Embree Viewer
Model: toy car
Shadow: off
Patches: 11,331
Triangle: 201,600
Vertices: 171,283
Subd level: 2
Tess level: 1
FPS: 72
Resolution: 800x800
CPU: two socket 20 core IvyBridge

Demo: Embree Viewer
Model: toy car
Shadow: on
Patches: 11,331
Triangle: 201,600
Vertices: 171,283
Subd level: 2
Tess level: 1
FPS: 45
Resolution: 800x800

Demo: Embree Viewer
Model: toy car
Patches: 11,331
Triangle: 201,600
Vertices: 171,283
Subd level: 2
Tess level: 1
Resolution: 800x800

Links for tools and libraries mentioned in this presentation
Optimized OpenSubdiv is in the following fork:
https://github.com/shengfuintel/OpenSubdiv, checkout branch intel
Intel tools and libraries mentioned in this presentation:
• ICC: https://software.intel.com/en-us/c-compilers
• ISPC: https://ispc.github.io
• Embree: https://embree.github.io
• VTune Amplifier: https://software.intel.com/en-us/intel-vtune-amplifier-xe
• TBB: https://www.threadingbuildingblocks.org/

C o p y r i g h t © 2 0 1 5 , I n t e l C o r p o r a t i o n . A l l r i g h t s r e s e r v e d . *O t h e r n a me s a n d b r a n d s ma y b e c l a i me d a s t h e p r o p e r t y o f o t h e r s .

Improving the performance of OpenSubdiv* on Intel Architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Improving the performance of OpenSubdiv* on Intel Architecture

Similar to Improving the performance of OpenSubdiv* on Intel Architecture (20)

More from Intel® Software

More from Intel® Software (20)

Recently uploaded

Recently uploaded (20)

Improving the performance of OpenSubdiv* on Intel Architecture