DreamWorks Animation

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
DreamWorks Animation*:
Slashing the cost of 3d Matrix
Math using X-Form
(Transform) Building Blocks

DreamWorks Animation:
Math using X-Form

Alex Wells (presenter)
& Martin Watt (DWA)
August 12 & 13, 2015
DreamWorks Animation:
Math using X-Form

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION
CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS
COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH
MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves
these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this
information.
The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated
purchases, including the performance of that product when combined with other products.
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar
performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other
platforms, and assigning them a relative performance number that correlates with the performance improvements reported.
SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See
http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.
Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and
software you use. For more information including details on which processors support HT Technology, see here
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration.
Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
No computer system can provide absolute security. Requires an enabled Intel® processor and software optimized for use of the technology. Consult your system manufacturer and/or software vendor for more information.
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families: Go to:
Learn About Intel® Processor Numbers
Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel’s current plan of record product roadmaps.
Copyright © 2014 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and
products specified are for planning purposes only and are subject to change without notice
*Other names and brands may be claimed as the property of others.
Legal Disclaimers
5

The above statements and any others in this document that refer to plans and expectations for the third quarter, the year and the future are forward-looking statements that
involve a number of risks and uncertainties. Words such as “anticipates,” “expects,” “intends,” “plans,” “believes,” “seeks,” “estimates,” “may,” “will,” “should” and their variations
identify forward-looking statements. Statements that refer to or are based on projections, uncertain events or assumptions also identify forward-looking statements. Many
factors could affect Intel’s actual results, and variances from Intel’s current expectations regarding such factors could cause actual results to differ materially from those
expressed in these forward-looking statements. Intel presently considers the following to be the important factors that could cause actual results to differ materially from the
company’s expectations. Demand could be different from Intel's expectations due to factors including changes in business and economic conditions; customer acceptance of
Intel’s and competitors’ products; supply constraints and other disruptions affecting customers; changes in customer order patterns including order cancellations; and changes
in the level of inventory at customers. Uncertainty in global economic and financial conditions poses a risk that consumers and businesses may defer purchases in response to
negative financial events, which could negatively affect product demand and other related matters. Intel operates in intensely competitive industries that are characterized by
a high percentage of costs that are fixed or difficult to reduce in the short term and product demand that is highly variable and difficult to forecast. Revenue and the gross
margin percentage are affected by the timing of Intel product introductions and the demand for and market acceptance of Intel's products; actions taken by Intel's competitors,
including product offerings and introductions, marketing programs and pricing pressures and Intel’s response to such actions; and Intel’s ability to respond quickly to
technological developments and to incorporate new features into its products. The gross margin percentage could vary significantly from expectations based on capacity
utilization; variations in inventory valuation, including variations related to the timing of qualifying products for sale; changes in revenue levels; segment product mix; the
timing and execution of the manufacturing ramp and associated costs; start-up costs; excess or obsolete inventory; changes in unit costs; defects or disruptions in the supply of
materials or resources; product manufacturing quality/yields; and impairments of long-lived assets, including manufacturing, assembly/test and intangible assets. Intel's
results could be affected by adverse economic, social, political and physical/infrastructure conditions in countries where Intel, its customers or its suppliers operate, including
military conflict and other security risks, natural disasters, infrastructure disruptions, health concerns and fluctuations in currency exchange rates. Expenses, particularly certain
marketing and compensation expenses, as well as restructuring and asset impairment charges, vary depending on the level of demand for Intel's products and the level of
revenue and profits. Intel’s results could be affected by the timing of closing of acquisitions and divestitures. Intel's results could be affected by adverse effects associated with
product defects and errata (deviations from published specifications), and by litigation or regulatory matters involving intellectual property, stockholder, consumer, antitrust,
disclosure and other issues, such as the litigation and regulatory matters described in Intel's SEC reports. An unfavorable ruling could include monetary damages or an
injunction prohibiting Intel from manufacturing or selling one or more products, precluding particular business practices, impacting Intel’s ability to design its products, or
requiring other remedies such as compulsory licensing of intellectual property. A detailed discussion of these and other factors that could affect Intel’s results is included in
Intel’s SEC filings, including the company’s most recent reports on Form 10-Q, Form 10-K and earnings release.
Risk Factors
6

7

Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.Copyright © 2015, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
 Before
 After
Overall Speedup 1.2x
8
DWA* Character Animation
Speedup After XBB
Motion System
Speedup 1.6x

 Motion System in DWA Character Animation
 Observed performance bottlenecks in Motion System
 3d Matrix transforms
 How would an ideal transform behave
 XBB representation
 XBB deferred evaluation
 Results
Agenda
9

 To represent bones of a skeleton in 3d space an
animation tool builds a Hierarchy of Joints and how
they are connected.
– Typically a Directed Acyclic Graph of Joints
How is a skeleton represented for
animation?
10

 Relative to a parent Joint (in Local Space), each Joint
needs to model:
– Rotational Euler Angles(around X, Y, and Z axis) & Order
– Scale (of X, Y, and Z axis)
– Shear (along X, Y, and Z axis)
– Translation (X, Y, and Z components)
 Animation curves change values over time
– drive the Joint’s attributes (rotation, translation, etc.)
How is a each Joint represented?
11

 Deformers which compute the final 3d vertices of a
character’s skin need an “Frame” of reference to apply
offsets from.
 The “World Space” Position and Orientation of the Joints
from the Hierarchy (skeleton) provide that “Frame” of
reference.
How does the skeleton influence the
skin?
12

Representing a “Frame” of reference
struct Matrix4x4
{
double m[4][4];
};
 A 4x4 Matrix can represent the Position and Orientation of a
Joint in World Space.
 When used in this manner, the 4x4 Matrix is commonly
referred to as a 3d transform (x-form).
 4x4 Matrix is typically implemented literally as a 4x4 array of
floating point values.
13

 Rotation, Scale, Shear, and Translation can all be
represented as 4x4 Matrices.
 Multiple 4x4 Matrices can be concatenated (multiplied)
together to a single 4x4 matrix.
 3d points and 3d vectors (offsets) can be multiplied through
a 4x4 Matrix to be transformed to the position and
orientation in “World Space” it represents.
 For each Joint
– matrices representing Scale, Shear, Rotation, and Translation are
combined together into a single “Local Space” 4x4 matrix.
Why a 4x4 Matrix?
14

 By recursively combining the “Local Space” transforms of a Joint
with its parent Joint’s “Local Space” until the root of the hierarchy
is reached, a 4x4 matrix can be accumulated that represents the
World Space of that Joint.
 As there are many joints, its pays off to cache a “World Space” 4x4
Matrix at each joint, so that a recursive walk up the hierarchy can
stop early if a clean “World Space” has been cached.
How To Calculate The World Space
Transform Of A Joint?
15

 Each time step, 1000’s of Joint attributes change,
invalidating a Hierarchy’s cached World Space and
Local Space transforms.
 1000’s of operations on Hierarchy objects build up a
complex skeleton.
Hierarchy is the core of
DWA’s Motion System
 Imagine how many bones are used to
represent a 4 legged creature with a
tail & wings.
 Due to the recursion, there is little
opportunity for data vectorization or
threading.
16

 Despite heavy parallelization of the Deformation System (green & yellow), it
can’t start until the Motion System (red) finishes assembling a Hierarchy.
Motion System Is On The Critical Path
17

 Motion System dwarfs the
other systems.
 Amdahl’s law limits our
threading & vectorization
improvements in the
deformation system from
having a larger overall
impact.
Wall Time Spent in Each Category
18

 “hier_apply_fk_around_pivot”
as the hottest operator
– Operates on a Hierarchy
– Verified in Intel® VTune™
Amplifier XE
 Several other “hier” related
operations taking up other
top hot spots.
Time Spent inside each type of Operator
19

 Typical implementation
– Loop over rows
– Loop over colums
– Compute result element by
multiplying one row of first matrix
across one column of the other
 Simple enough, but how much
work did we really just do?
struct Matrix4x4
{
double m[4][4];
};
20
Matrix4x4 operator * (const Matrix4x4 &iOther)
{
Matrix4x4 result;
for (int r=0;r < 4; ++r)
{
for (int c=0;c < 4; ++c)
{
double sum = 0.0;
for(int k=0; k < 4; ++k)
{
sum += m[r][k]*iOther.m[k][c];
}
result.m[r][c] = sum;
}
}
return result;
}
Matrix Concatenation (Multiplication)

 64 Multiplies (double precision)
 48 Additions (double precision)
Expensive Matrix Concatenation
Matrix4x4 operator * (const Matrix4x4 &iOther)
{
Matrix4x4 result;
result.m[0][0] =
m[0][0]*iOther.m[0][0] +
m[0][1]*iOther.m[1][0] +
m[0][2]*iOther.m[2][0] +
m[0][3]*iOther.m[3][0];
result.m[0][1] =
m[0][0]*iOther.m[0][1] +
m[0][1]*iOther.m[1][1] +
m[0][2]*iOther.m[2][1] +
m[0][3]*iOther.m[3][1];
result.m[0][2] =
m[0][0]*iOther.m[0][2] +
m[0][1]*iOther.m[1][2] +
m[0][2]*iOther.m[2][2] +
m[0][3]*iOther.m[3][2];
result.m[0][3] =
m[0][0]*iOther.m[0][3] +
m[0][1]*iOther.m[1][3] +
m[0][2]*iOther.m[2][3] +
m[0][3]*iOther.m[3][3];
result.m[1][0] =
m[1][0]*iOther.m[0][0] +
m[1][1]*iOther.m[1][0] +
m[1][2]*iOther.m[2][0] +
m[1][3]*iOther.m[3][0];
result.m[1][1] =
m[1][0]*iOther.m[0][1] +
m[1][1]*iOther.m[1][1] +
m[1][2]*iOther.m[2][1] +
m[1][3]*iOther.m[3][1];
result.m[1][2] =
m[1][0]*iOther.m[0][2] +
m[1][1]*iOther.m[1][2] +
m[1][2]*iOther.m[2][2] +
m[1][3]*iOther.m[3][2];
result.m[1][3] =
m[1][0]*iOther.m[0][3] +
m[1][1]*iOther.m[1][3] +
m[1][2]*iOther.m[2][3] +
m[1][3]*iOther.m[3][3];
result.m[2][0] =
m[2][0]*iOther.m[0][0] +
m[2][1]*iOther.m[1][0] +
m[2][2]*iOther.m[2][0] +
m[2][3]*iOther.m[3][0];
result.m[2][1] =
m[2][0]*iOther.m[0][1] +
m[2][1]*iOther.m[1][1] +
m[2][2]*iOther.m[2][1] +
m[2][3]*iOther.m[3][1];
result.m[2][2] =
m[2][0]*iOther.m[0][2] +
m[2][1]*iOther.m[1][2] +
m[2][2]*iOther.m[2][2] +
m[2][3]*iOther.m[3][2];
result.m[2][3] =
m[2][0]*iOther.m[0][3] +
m[2][1]*iOther.m[1][3] +
m[2][2]*iOther.m[2][3] +
m[2][3]*iOther.m[3][3];
result.m[3][0] =
m[3][0]*iOther.m[0][0] +
m[3][1]*iOther.m[1][0] +
m[3][2]*iOther.m[2][0] +
m[3][3]*iOther.m[3][0];
result.m[3][1] =
m[3][0]*iOther.m[0][1] +
m[3][1]*iOther.m[1][1] +
m[3][2]*iOther.m[2][1] +
m[3][3]*iOther.m[3][1];
result.m[3][2] =
m[3][0]*iOther.m[0][2] +
m[3][1]*iOther.m[1][2] +
m[3][2]*iOther.m[2][2] +
m[3][3]*iOther.m[3][2];
result.m[3][3] =
m[3][0]*iOther.m[0][3] +
m[3][1]*iOther.m[1][3] +
m[3][2]*iOther.m[2][3] +
m[3][3]*iOther.m[3][3];
return result;
}
21

 Good news! YES!
 If you knew the exact transform a 4x4 matrix was
representing, you would know quite a few 0 and 1
values at compile time.
Are Any of Those 16 Matrix Values Known
At Compile Time?
Identity
[1][0][0][0]
[0][1][0][0]
[0][0][1][0]
[0][0][0][1]
Translation(x,y,z)
[1][0][0][0]
[0][1][0][0]
[0][0][1][0]
[x][y][z][1]
Shear(x,y,z)
[1][0][0][0]
[x][1][0][0]
[y][z][1][0]
[0][0][0][1]
Scale(x,y,z)
[x][0][0][0]
[0][y][0][0]
[0][0][z][0]
[0][0][0][1]
22

 Building rotation matrices is more expensive because of the need
to call sine and cosine on the angle
 Rotations also have 0 and 1 values
What About Rotations?
Rotate X axis(angle)
[1][0][0][0]
[0][c][s][0]
[0][-s][c][0]
[0][0][0][1]
Rotate Y axis(angle)
[c][0][-s][0]
[0][1][0][0]
[s][0][c][0]
[0][0][0][1]
Rotate Z axis(angle)
[c][s][0][0]
[-s][c][0][0]
[0][0][1][0]
[0][0][0][1]
23
let s = sine(angle)
let c = cosine(angle)

 Unfortunately, the matrix multiply method doesn’t
know that the 4x4 Matrix it was passed has any 0 or 1
values
– So it can not avoid performing math operations.
 Even if we had separate classes to represent the
different transformations and multiple versions of the
matrix multiply method for each
– The result becomes a general 4x4 matrix.
– Chains of multiplication would only benefit on the 1st multiply
operation
Huge Optimization Potential!
24

 Pseudo algorithm to compute a Joint’s World Space
– 10 4x4 matrix multiplications
– 1 matrix inversion (very expensive) in the middle
 YES… But you won’t even want to try
 Good luck getting the expanded math right
Can we expand the math by hand?
JointWorldSpace = Scale*Shear*
ParentScale*ParentShear*
RotZ*RotY*RotX*
((ParentScale*ParentShear).inverse())*
Translate*
ParentWorldSpace;
25

 Must keep high level representation of algorithm
 Perform the absolute minimum required number of
math operations
– It must track known values
– Continue tracking values through matrix multiplications
 Utilize known information to provide a cheaper
alternative to full matrix inversions
 Interface/Adapt to existing 4x4 Matrix data types
Ideal Transform Behavior
26

C++ library to enable composition of 3d transforms
Instead of a general purpose 4x4 matrix, it provides
specific types for different transforms.
Track known values through multiplication chains
Deferred Evaluation
Localized source code changes required to take
advantage of
Introducing Xform Building Blocks (XBB)
27

XBB Scale, Shear3, & Translation
ref::Matrix4x4 S;
S.makeScale(scaleX, scaleY, scaleZ);
ref::Matrix4x4 SH;
SH.makeShear3(shearX, shearY, shearZ);
ref::Matrix4x4 T;
T.makeTranslation(transX, transY, transZ);
128 Bytes of Stack
Used Per 4x4 Matrix
Overhead to initialize to Identity(),
then overwrite elements
28
xbb::Scale S(scaleX, scaleY, scaleZ);
xbb::Shear3 SH(shearX, shearY, shearZ);
xbb::Translation T(transX, transY, transZ);
 Before  After XBB
24 Bytes of Stack
No overhead to initialize
4x4 elements that are
known to be 0 or 1
for each type of transform

XBB Transform Representation
struct Translation
{
double x;
double y;
double z;
…
};
29
 Stores only non-constant data
needed to represent a 4x4 matrix of
the transform type
 Provides methods for element level
access to a 4x4 matrix
– Return known constant values
double e10() const { return 0.0; }
double e30() const { return x; }
double e31() const { return y; }
double e32() const { return z; }
Translation(x,y,z)
[1][0][0][0]
[0][1][0][0]
[0][0][1][0]
[x][y][z][1]

XBB Transform Constancy
enum Constancy
{
ConstantZero,
ConstantOne,
NotConstant
};
30
 Each transform identifies if each 4x4
matrix element is a constant 0, 1, or
Not Constant
 Constancy is suitable as template
parameter
– Matrix Multiply will make use of
static const Constancy c10 = ConstantZero;
static const Constancy c11 = ConstantOne;
static const Constancy c30 = NotConstant;
Translation(x,y,z)
[1][0][0][0]
[0][1][0][0]
[0][0][1][0]
[x][y][z][1]

XBB Rotations
ref::Matrix4x4 Rx;
Rx.makeRotationX(rotX);
ref::Matrix4x4 Ry;
Ry.makeRotationY(rotY);
ref::Matrix4x4 Rz;
Rz.makeRotationZ(rotZ);
128 Bytes of Stack
Used Per 4x4 Matrix
Overhead to initialize to Identity(),
then overwrite elements
31
xbb::RotationX Rx(rotX);
xbb::RotationY Ry(rotY);
xbb::RotationZ Rz(rotZ);
 Before  After XBB
16 Bytes of Stack
No overhead to initialize
4x4 elements that are
known to be 0 or 1
for each type of transform
sin(angle)
cosine(angle)
sine(angle)
cosine(angle)

XBB Rotation Representation
struct RotationX
{
double cosineOfAngle;
double sineOfAngle;
…
};
32
 Stores the sine and cosine of the
angle, not the angle itself.
 Provides methods for element
level access to a 4x4 matrix
– Return known constant values
double e11() const { return cosineOfAngle; }
double e12() const { return sineOfAngle; }
double e21() const { return -sineOfAngle; }
double e22() const { return cosineOfAngle; }
Rotate X axis(angle)
[1][0][0][0]
[0][c][s][0]
[0][-s][c][0]
[0][0][0][1]

XBB Multiply
ref::Matrix4x4 SxSH;
SxSH = S*SH;
33
auto SxSH = S*SH;
xbb::Matrix4x3 SxSH_Matrix;
SxSH.to(SxSH_Matrix);
 Before
 After XBB
No Math is performed.
Instead, a new type
Multiply<Scale, Shear3>
is returned
Math is deferred until you explicitly
export to a general purpose matrix.
XBB’s Multiply uses the Constancy
of its template parameters to
define its own Constancy values

Multiplication Chains
ref::Matrix4x4 jointLocalSpace;
jointLocalSpace = S*SH*Rz*Ry*Rx*T;
34
xbb::Matrix4x3 jointLocalSpace;
(S*SH*Rz*Ry*Rx*T).to(jointLocalSpace);
 Before
 After XBB
Confirmed assembly has
minimum math operations
5 matrix multiplications:
320 multiplications
240 adds
Speedup 2.45x
Multiply<Multiply<Multiply<Multiply<Multiply<Scale, Shear3>,
RotationZ>,
RotationY>,
RotationX>,
Translation>

Deferred Evaluation (reduce)
35
typedef ReducedMatrix
<
c00, c01, c02, c03,
c10, c11, c12, c13,
c20, c21, c22, c23,
c30, c31, c32, c33
> ReducedType;
 ReducedMatrix based on a transform’s
Constancy.
– Only has data members for NotConstant matrix
elements
 Multiply’s reduce recursively expands its left
and right operands
– Expands out entire multiplication chain
 4x4 elements setByMatrixMultiply
– Actually multiplies a column by row
– Knows Constancy of the elements from reduced
left and right transforms
 Using template specialization based on the
Constancy
– Only exact terms necessary are accessed
– Emits only necessary multiplications & additions
ReducedType Multiply::reduce() const
{
const auto tl = left.reduce();
const auto tr = right.reduce();
ReducedType r;
r.setByMatrixMultiply<0,0>(tl,tr);
...
return r;
}

 Many Hierarchy operations change only Translation of a Joint.
– If we could cache the Rotation transforms, then many expensive
sin/cos calls could be avoided.
– Matrix4x4 is too big (128 bytes) to cache one for each Rotation X, Y,
and Z.
 XBB rotations are only 16 bytes each
– Small enough to cache inside the Joint object
XBB: Cached Rotations
(S*SH*cached.Rz*cached.Ry*cached.Rx*T).to(jointLocalSpace);
Use Cached Sin/Cos of Angles
Speedup 12.71x
36

 Identity is free in any multiplication chain
– Optimized out entirely
– Only 1 byte of stack space (empty struct)
 Transpose is free in any multiplication chain
– Deferred evaluation pulls results out in different order
– No additional math or data movement
XBB Identity & Transpose
Identity id;
(S*SH*id*R*T).to(result);
37
(S*SH*R*T).transpose().(result);

 Inverse is very expensive
– Determinant
– Cofactor
– Transpose
– Division
– scalar matrix multiply
Before: Inverse of (Scale*Shear)
inverseOfSxSH = (S*SH).inverse();
38

(S*SH).inverse().to(inverseOfSxSH);
 MAGIC happens
– Inverse becomes part of deferred evaluation!
 Because we have a representation of the multiplication chain
– we can move the inverse inside the multiplication chain and reverse its order
 Inverse of most transform primitives is free
– except Scale which costs 3 divisions
 During deferred evaluation
– the logical 4x4 matrix values are reordered and flip signs where needed to
represent its inverse
(SH.inverse()*S.inverse()).to(inverseOfSxSH);
Speedup 6.43x
39
After XBB: Inverse of (Scale*Shear)

 Provide template specializations for adapters to map between DWA
math classes and XBB’s.
– Allows XBB deferred evaluation directly into DWA matrix types
 In many scenarios, the transforms could have been Identity based on
logic inside the Joint.
– To take full advantage of XBB, we needed to know the exact type of transforms
of involved.
 Templatized Hierarchy algorithm making conditional logic controlled
by template parameters. e.g.
– Order of Rotations
– Scale Propagation Mode
 Specialized templates based on parameters to
– Use the correct type of XBB transform
 Identity whenever possible
– Multiply the Rotations in the correct order
XBB Integration to DWA Motion System
40

 Built a jump table with instances of the algorithm for all the
different combinations of options and rotation orders.
– Used enums as indexes into multi-dimensional array of function
pointers to the corresponding algorithm instance to execute.
 Used XBB for decomposing World Space Matrix4x4 into individual
Joint attributes.
 Rewrote expensive “hier_apply_fk_around_pivot” with XBB directly
vs. going through Hierarchy object
– Avoid high overhead of building Hierarchy on on the fly
 Performed non XBB related optimizations
– Reduced dynamic memory allocation by replacing local std::vector<T>
with stack based array when possible
XBB Integration to DWA Motion System
(continued…)
41

 Before
 After
XBB DWA Motion System Results
Overall Speedup 1.2x
42
hier_apply_fk_around_pivot
Speedup 2.8x
Motion System
Speedup 1.6x

 Reducing the Critical Path helped Thread Scaling.
43
XBB DWA Motion System Scaling
Reached goal of 30 fps
on single Avoton cartridge

 Good way to improve the impact of vectorization or
threading is to reduce the amount of work being done
outside those data parallel regions.
– Ideally do less work in the first place.
 Complex optimization problems can be represented in C++
and presented back to the compiler in a form it can excel at
optimizing.
– Expanding math by hand is untenable.
 You can do much more with C++11/14 to encapsulate
problems while retaining the original high level algorithm
– Look for optimization problems that might be representable at a
higher level.
Call to Action
44

 XBB has exactly the features required to support the DWA
Motion System.
 For general purpose use
– more transformations and math operations might be required. e.g.
 Inverse of general 4x4 matrix
 Single precision version or template based data type
 XBB can be licensed or potentially open sourced upon
request.
– Could be of use to CAD, Animation Tools, and Gaming.
 Contact Alex Wells (alex.m.wells@intel.com)
Future Work
45

C o p y r i g h t © 2 0 1 5 , I n t e l C o r p o r a t i o n . A l l r i g h t s r e s e r v e d . *O t h e r n a me s a n d b r a n d s ma y b e c l a i me d a s t h e p r o p e r t y o f o t h e r s .

DreamWorks Animation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (16)

Similaire à DreamWorks Animation

Similaire à DreamWorks Animation (20)

Plus de Intel® Software

Plus de Intel® Software (20)

Dernier

Dernier (20)

DreamWorks Animation