SQLDay2013_ChrisWebb_CubeDesign&PerformanceTuning

Real-World Cube Design and Performance
Tuning with SSAS Multidimensional
Chris Webb

Who Am I?
•

Chris Webb
• Email: chris@crossjoin.co.uk
• Twitter @Technitrain

•
•

Analysis Services consultant and trainer
www.crossjoin.co.uk & www.technitrain.com
Co-author:
• MDX Solutions
• Expert Cube Development with SSAS 2008
• Analysis Services 2012: The BISM Tabular Model

•
•

SQL Server MVP
Blogger: http://cwebbbi.wordpress.com

BID-202

Agenda
• 9:00-11:00 - Working in BIDS/SSDT and Dimension
Design Tips & Tricks
• 11:15-12:30 – Measure and Measure Group Tips &
Tricks
• 13:30-14:15 – Understanding SSAS Query Execution
• 14:15-15:00 - Designing Effective Aggregations
• 15:15-16:30 – Tuning MDX Calculations
• 16:45 – 17:30 - Caching and Cache Warming, Q&A

Working in BIDS,
Data Sources and Data Source Views

Agenda
•
•
•
•
•

Online vs Project mode
Project properties
Data Sources
Data Source Views
BIDS Helper

Using BI Development Studio
• BI Development Studio (BIDS) or SQL Server Data Tools
(SSDT) in 2012 is a Visual Studio addin for designing SSAS
databases
• You can use it in two ways:
– Project mode, where the SSAS database is held locally and
has to be specifically deployed to the server
– Online mode where you work live on the server and
deployment takes place every time you Save

• Project mode is best because:
– You can use source control
– If you make a mistake and save it, you don’t have the risk
that it will unprocess the cube

Configuring a New SSAS Project
• On a new SSAS project in BIDS, always set the
following properties:
– Processing Option should be set to Do Not Process
– Server should be changed from localhost to the real
name of your SSAS server
– Database should be set to the name of the SSAS
database you want to create
– Deployment Server Edition should be set to the name
of the SQL Server edition you plan to use in
production

Creating a Data Source
• A Data Source represents a connection to the cube’s
relational source data
• You need to choose which provider to use
– In general native OLEDB is faster than .NET

• You need to choose what security is needed for SSAS
to connect to the data source
– Supply username/password, eg for Oracle
– Windows Authentication

Creating a Data Source
• You need to choose which Windows user SSAS will
impersonate to connect to the cube
– Easiest to use the SSAS service account
– Most secure to supply a username/password

• You can have multiple Data Sources in a project but it
is a bad idea
– Your data should already be in one data mart!
– Needs to use SQL Server OpenRowset – slow

Creating a Data Source View
• A Data Source View (DSV) is simply an intermediate
stage where you can
– Pick the tables you wish to use to build your cube
– Define logical relationships between tables (note: they
are not written back to the RDBMS)
– Rename tables and columns
– Tweak the data so it is modelled the way you want

• DSVs are not visible to end users

Named Calculations and Named
Queries
• Named Calculations allow you to create derived
columns in tables
– Need a SQL expression that returns a value

• Named Queries allow you to create the equivalent of
views in the DSV
– Need a SQL SELECT statement

Dos and Don’ts of DSVs
• DO add in logical foreign key relationships between
tables, even if they are not physically present in your
database
– Means BIDS will automatically set other properties for you
later on in the design process

• DO create multiple diagrams to reduce complexity
• DO NOT use them as a replacement for proper ETL!
– It will slow down SSAS processing
– It hides important aspects of your design
– Use relational views if you must, or even better change
your SSIS packages

BIDS Helper
• BIDS Helper is a free tool that adds much important
functionality to BIDS
– Better attribute relationship visualisation
– Better aggregation designer
– Features for checking common problems

• Install it!
http://bidshelper.codeplex.com/

Agenda
•
•
•
•
•

Dimensions and Attributes
Attribute Hierarchies and User Hierarchies
Attribute Relationships
Slowly Changing Dimensions
Parent Child and Ragged Hierarchies

Dimensions and Attributes
• An SSAS Dimension is composed of multiple
Attributes
• Attributes represent the different levels of
granularity on a dimension that data can be
aggregated up to
– On a Time dimension you might have attributes like
Year and Month
– Based on the columns in the dimension table

• By default, each attribute becomes a hierarchy you
can use when browsing the cube

Dimension Properties
• ProcessingGroup – using ByTable results in a single
query being issued to process the whole dimension
– Might improve performance but not usually a good
idea!
– Also need to turn off Duplicate Key errors

• StringStoresCompatibilityLevel (new in 2012) - avoids
4GB string store limit
– May slow performance and increase size on disk

Attribute Properties
• AttributeHierarchyEnabled controls whether an attribute
hierarchy will be built
– If one isn’t the attribute is visible only as a property of
another attribute
– Useful for things like phone numbers and addresses
– Not building a hierarchy can save time when processing
the dimension

• AttributeHierarchyOptimizedState controls whether
indexes are built
– Can save time processing the dimension
– But can have a big impact on query performance

Attribute Properties
• DefaultMember controls what the default selected
member is
– If you don’t set it, the default member will be the first real
member on the first level
– So usually the All Member
– Setting it can confuse users

• IsAggregatable controls whether an All Member is
created for a hierarchy
– Some hierarchies contain data that should never be
aggregated, eg different budgets
– If this is set to False, always set a specific default member

Ordering Attributes
• The OrderBy property controls the order of members
on a hierarchy
– You can order by name, key, or the value of another
attribute

• The AttributeHierarchyOrdered property controls
whether the attribute is ordered at all
– Not ordering it can result in faster processing
– But means that member ordering is arbitrary

Time Dimensions
• Setting the Type property of a dimension to ‘Time’
enables some time-specific features:
– Certain MDX functions, like YTD, are time-aware
– Semi-additive measures need a time dimension

• You will also need to set the Type property on
attributes such as years, quarters, months and dates
for everything to work properly

Attributes and User Hierarchies
• User hierarchies are multi-level hierarchies that
support more complex drill paths
• Each level in a user hierarchy is based on an attribute
– On a Time dimension there might be a Calendar
Hierarchy with levels Year, Quarter, Month, Date

• If you put an attribute hierarchy in a user hierarchy,
then set its Visible property to False

Attribute or User Hierarchies?
• As always, it depends...
• User hierarchies:
– Are easier for users to understand
– Make writing calculations where you need to find
ancestors/descendants easier
– May lead to better query performance overall

• Attribute hierarchies:
– Are more flexible – you can’t put two levels of the
same user hierarchy on different axes

Attribute Relationships
• Attribute Relationships describe the one-to-many
relationships between attributes
– For example a Year contains many Months and a Month
contains many Dates

• All attributes must have a one-to-many relationship with
the Key attribute
• Natural user hierarchies are user hierarchies where every
level has a one-to-many relationship with the one below
• The BIDS Helper Dimension Health Check functionality
can check whether attribute relationships have been
correctly configured

Optimal Attribute Relationships
• Optimising attribute relationships is very important
for query performance
• In general, query performance will be better with
‘bushy’ rather than ‘flat’ attribute relationships
• Tweaking your data to allow the creation of chains of
relationships is a good thing
– So long as it doesn’t compromise your data

• Also consider if merging many dimensions into one
will allow you to build better relationships

Attribute Relationship Scenarios
• Forks versus Chains
• Creating dummy attributes to make unnatural
hierarchies natural
• Type 2 Slowly Changing Dimensions
• Diamond-shaped relationships

Slowly Changing Dimensions
• Analysis Services has no specific support for SCDs,
but can handle them very easily
• Type 1 SCD changes require a Process Update
– Aggregations may also need to be rebuilt

• Type 2 SCD changes require a Process Add or a
Process Update
• Modelling attribute relationships correctly on a Type
2 SCD can make a big different to query performance

Rigid and Flexible Relationships
• Attribute relationships can be rigid or flexible
• Rigid relationships are relationships that will never
undergo a Type 2 change
• Flexible relationships may undergo a Type 2 change
• Any aggregation that includes an attribute with a
flexible relationship between it and the key attribute
will be dropped after a Process Update on a
dimension

Parent/Child Hierarchies
• Parent/child hierarchies can be built on tables which
have a recursive or “pig’s ear” join
• Allow you to build variable-depth hierarchies such as Org
Charts or Charts of Accounts
• Are very flexible, but have several limitations:
– You can only have one per dimension, and it must be built
from the key attribute
– They can cause serious query performance problems,
especially when they contain more than a few thousand
members

• Avoid using them unless you have no other choice!

Members with Data
• Parent/child hierarchies can be used to load non-leaf
data into a cube
• Useful for data that is non-aggregatable
– Eg Data that has been anonymised

• However will perform relatively badly
• Each member on a parent/child hierarchy has an
associated DataMember
– The non-leaf data is associated with the DataMember
– The DataMember can be visible or hidden
– By default, the DataMember’s value will be aggregated
along with the other children’s values, but this can be
overridden with MDX

Ragged Hierarchies
• In some cases, you can make a normal user hierarchy
look ‘ragged’ by setting the HideMemberIf property
• This allows members to be hidden in a user hierarchy
when they have
– No name
– The same name as their parent
– Either of the above plus they are the only child

• Useful when you know in advance the maximum
number of levels in your hierarchy

Ragged Hierarchies
• Have to set the connection string property MDX
Compatibility=2 to make them appear ragged in all
cases
– Most client tools do this by default; Excel and the
Dimension Editor do not

• To make Excel work, duplicate values from the
bottom of the hierarchy upwards

Modelling complex Measures and Measure
Groups

Cubes and Measure Groups
• A Cube should contain all of the data that any user
would want to query
– You can’t query more than one cube at a time

• Each Fact Table in your data mart will become a
Measure Group in your cube
• Usually better to create one cube with multiple
measure groups than multiple cubes
• You can use Perspectives to simplify what end users
see

When To Use Multiple Cubes
• A single cube has some disadvantages:
– A single cube can become too complex to maintain
and develop
– If a dimension needs a full process, then a single cube
needs completely reprocessing
– Multiple cubes can be spread over multiple servers for
scale-out
– Multiple cubes are easier to secure

Measure Aggregation
• Basic measure aggregation types are Sum, Count
(either rows in the fact table, or non-null column
values), Min and Max
– Average can be written with some simple MDX

• Enterprise Edition gives several more complex
aggregation functions
• Custom aggregation can be achieved with MDX
calculations

Distinct Count
• Distinct Count measures will be created in their own
measure group
– This is for performance reasons

• They are likely to perform worse than other
measures
– Require a special partitioning strategy
– Very IO intensive – solid state disks can be useful

Semi-Additive Measures
• Semi-additive measures act like Sum for all
dimensions except the Time dimension
– NB the Time dimension needs to have its Type
property set to ‘Time’
– If there are multiple Time dimensions, the measures
will only be semi-additive for one

• They are: FirstChild, LastChild, FirstNonEmpty,
LastNonEmpty, AverageOfChildren
• Useful for handling snapshot fact data

Other Aggregation Types
• None means that data only appears at the
intersection of the leaves of all dimensions
• ByAccount allows you to create measures which have
different semi-additive behaviour for different
members
– Again, useful for financial applications

Regular Relationships
• The Dimension Usage tab in BIDS shows how
Dimensions join to Measure Groups
• No Relationship is a valid relationship type!
• Regular relationships are the equivalent of inner
joins between dimensions and measure groups
– Most commonly used relationship type

Different Granularities
• The Granularity Attribute property controls which
attribute is used to make the join on a Regular
relationship
– Usually the Key Attribute is used
– Join always takes place on the columns used in the
attribute’s KeyColumns property

• The IgnoreUnrelatedDimensions property controls what
happens below granularity and when there is no
relationship between dimensions and measure groups
– Can either display nulls
– Or repeat the All Member value

Fact Relationships
• Very similar to Regular Relationships, but are used
when the dimension has been built directly from the
fact table
• Differences between Fact and Regular:
– Fact relationships generate slightly better SQL for
ROLAP queries
– A measure group can only have one Fact relationship
– Fact relationships may be treated differently by some
client tools

Referenced Relationships
• Used when a dimension connects to a measure group via
another dimension
• Materialized referenced relationships give fast query
performance
– Require a join between the Reference dimension and the
fact dimension at processing, which can slow down
processing a lot

• Non-materialized referenced relationships result in
slower query performance but no overhead at processing
time
• Often better to build a single dimension instead, unless it
would result in excessive duplication of attributes

Many-to-Many Relationships
• Allow you to model scenarios where dimension members
have many-to-many relationships with facts
– Eg a Bank Account might be associated with several
Customers

• M2M relationships require a bridge table
– Becomes an ‘intermediate’ measure group
– Once all regular relationships have been set up, the M2M
relationship can be created

• Carry a query performance penalty
• But extremely flexible and a very important feature for
modelling many complex scenarios

Role-Playing Dimensions
• It is possible to add the same Database dimension to
a cube more than once
– This is called a role-playing dimension
– Eg the Date dimension could be an Order Date and a
Ship Date

• Saves time processing and on maintenance
• But since you can only change the name of the
dimension, and not the hierarchies on it, it can be
confusing for users

Measure Expressions
• Measure Expressions allow you to:
– Multiply or divide two measures before aggregation
takes place
– The measures have to be from two different measure
groups
– There needs to be at least one common dimension

• Useful for currency conversion or share calculations
• Faster than the equivalent MDX

Partitions
• Measure groups can be divided up into multiple
partitions
– Enterprise Edition only

• You should partition your measure groups to reflect the
slices that users are likely to use in their queries
– Usually by the Time dimension

• Partitioning improves query performance in two ways:
– It reduces overall IO because AS should only scan the
partitions containing the data requested
– It increases parallelism because AS can scan more than
one partition simultaneously

Partitioning for Regular Measures
• There are two things to consider when planning a
partitioning strategy:
– How and when new data is loaded
– How users slice their queries

• Luckily, partitioning by time usually meets both
needs
– Eg building one partition per day or month

• Can also partition by another dimension if the
resulting partitions are too large, or users regularly
slice queries in other ways

Partition Sizing
• How big should a partition be?
– Up to 20-50 million rows is ok, but 100 million rows can be
fine too
– Try not to create small partitions: those with less than
4096 rows will have no indexes built

• Also, try not to create too many partitions:
– BIDS and SQLMS get slower the more partitions you have
(can be very bad in older versions of SSAS)
– Aggregations are only built per partition, so over
partitioning reduces their effectiveness
– Aim for no more than 500 to 2000 partitions

Setting the Slice
• By default, SSAS knows the minimum and maximum
dataids of members from each attribute in a partition
• Uses this information to work out which partitions to
scan for a query
• If ranges overlap then SSAS may scan many partitions
unnecessarily
• To stop this, set the Slice property on the partition
– Takes the form of an MDX tuple

Partitioning for Distinct Count
• Partitioning for distinct count measures must take a
completely different approach
– Remember distinct count measures should be in their
own measure group anyway

• You need to:
– Ensure each partition contains a non-overlapping
range of values in the column you’re doing a distinct
count on
– Ensure that each query hits as many partitions as
possible, so SSAS can use as many cores as possible in
parallel

Partitioning and M2M
• Partitioning large intermediate measure groups used
in many-to-many relationships can be useful for
performance
– Either partition by the dimensions used in the m2m
relationship
– Or add other dimensions to the relationship

• Querying by a m2m dimension does not result in a
filter being applied on intermediate dimensions
– Can result in a large number of partition scans

Controlling the Amount of Parallelism
• The CoordinatorExecutionMode server property
defines the number of partitions in one measure
group that can be queried in parallel
– By default is set to -4 = 4 * number of processors

• CoordinatorQueryBalancingFactor and
CoordinatorQueryBoostPriorityLevel can be used to
limit the ability of a single query to use all resources
and block other queries

Understanding How Analysis Services
Answers Queries

Agenda
•
•
•
•

SSAS Architecture Overview
The Storage Engine
The Formula Engine
Using Profiler

How SSAS answers queries
MDX Query In

Cellset Out

Cache

Query Subcube
Requests

Cache

Disk

Formula Engine
works out what data is
needed for each query,
and requests it from the
Storage Engine

Storage Engine
handles retrieval of
raw data from disk,
and any aggregation
required

Subcubes
• Subcubes are slices of data from the cube
• Appear in many places:
– MDX Scope statements
– Caching
– Query Subcube requests

• Subcubes are defined by their granularity and slice

Granularity
• Single grain
– List of GROUP BY attributes in SQL SELECT

• Mixed grain
– Both Attribute.[All] and Attribute.MEMBERS

Granularity
All Country,
All City
All
Products

Products

Countries,
All City

Countries,
Cities

Lattice of granularities
All, All, All

All, All, *

*, All, All

All, *, All

All, *, *

*, All, *

*, *, All

*, *, *

Granularity ordering
•
•
•
•
•

(All,*) is above (*,*)
(All,*) is below (All,All)
(All,2008) is at (All,*)
(All,*) and (*,All) are incompatible
Attribute relationships make some arrows
impossible:
(State.*, All Cities) – OK
(State.All, City.*) – cannot occur

Slice
• Single member
– SQL: Where City = ‘Redmond’
– MDX: [City].[Redmond]

• Multiple members
– SQL: Where City IN (‘Redmond’, ‘Seattle’)
– MDX: { [City].[Redmond], [City].[Seattle] }

Slice at granularity
 SQL
SELECT Sum(Sales), City FROM Sales_Table
WHERE City IN (‘Redmond’, ‘Seattle’)
GROUP BY City

 MDX
SELECT Measures.Sales ON 0
, NON EMPTY {Redmond, Seattle} ON 1
FROM Sales_Cube

Slice below granularity
 SQL
SELECT Sum(Sales) FROM Sales_Table
WHERE City IN (‘Redmond’, ‘Seattle’)

 MDX
SELECT Measures.Sales ON 0
FROM Sales_Cube
WHERE {Redmond, Seattle}

Examples
All Years
All Cities
Redmond
Seattle

New York
London

2005

2006

2007

2008

Examples
All Years

2005

All Cities
Redmond
Seattle

New York
London

(Seattle, Year.Year.MEMBERS)

2006

2007

2008

Examples
All Years

2005

All Cities
Redmond
Seattle

New York
London

(Seattle, Year.MEMBERS)

2006

2007

2008

Examples
All Years

2005

2006

2007

All Cities
Redmond
Seattle

New York
London

({Redmond, Seattle, London}, Year.MEMBERS)

2008

Examples
All Years

2005

2006

All Cities
Redmond
Seattle

New York
London

({Redmond, Seattle}, {2005, 2006, 2007})

2007

2008

Retrieving Data
• The first step to answering a query is for the FE to
generate a query plan
– Unfortunately, we can’t see these query plans

• The next step to answering a query is to work out which
subcubes of real (ie not calculated) data are needed
– Not just for ‘real’ measures but also for calculations too

• These are translated to single-grain subcubes
• Each request from the FE to the SE is visible in Profiler as
a Query Subcube request

Sonar
• Sonar is the component of the FE engine that
determines which subcubes should be retrieved to
answer the query
• For complex queries this can involve many difficult
decisions:
– One large subcube or many small ones?
– In some cases it may be more efficient to retrieve
more data than the query requires

2

2

3

4

2

5

6

7

2

3

4

5

6

7

2

2

3

4

[2, *]
Non precise solution

2

5

6

7

2

3

[2, {2,3,4,5,6}]
Precise solution

4

5

6

7

Which one is better ?
2

2

3

4

[2, *]
Non precise solution

2

5

6

7

2

3

[2, {2,3,4,5,6}]
Precise solution

4

5

6

7

Controlling Sonar
• Sometimes SSAS can decide to prefetch too much
data, causing performance problems
• To stop this happening
– Rewrite your query to use different attributes or user
hierarchies
– Use a subselect instead of a WHERE clause (though
this will impact caching)
– Use the Disable Prefetch Facts=True; Cache Ratio=1
connection string properties

The Storage Engine
• The SE receives requests for subcubes of data from
the FE
• Then:
– If the subcube can be derived from data in cache, it
will read data from cache
– If the subcube can be derived from an aggregation, it
will read data from the aggregation
– Otherwise it will read the data from a partition

• Finally it will perform any aggregation necessary and
return the subcube

Reading Data from Disk
• Next, the SE determines which
partitions/aggregations contain the required slices of
data
• It then reads data from these partitions
– It will attempt to read from several partitions in
parallel if possible

• Finally merges and aggregates the data from all
partitions and loads it into the SE cache
• For distinct count measure groups things are more
complex...

The Formula Engine
• The FE can then read data from the SE cache and
perform calculations
• The FE tries to determine which subcubes have the same
calculations applied to them
– It can then perform calculations over large numbers of
cells very efficiently
– Known as ‘bulk mode’ or ‘block computation’

• When it cannot do this, it has to iterate over every cell,
determine which calculations need to be applied, and
perform the calculations
– This can be very slow
– Known as ‘cell by cell’ mode

Bulk Mode/Block Computation
• Cube space populated at fact table generally extremely sparse
–

•

Goal is to compute expressions only where they need to be computed
–

•

Values only exist for small minority of possible combinations of dimension keys
Most often, everything takes on a default value typically (but not always) null

Take following example:

WITH MEMBER Measures.ContributionToParent AS
Measures.Sales/
(Measures.Sales, Product.CurrentMember.Parent)
SELECT
Product.[Product Family].members ON COLUMNS,
Customer.Country.Members ON ROWS
FROM
Sales
WHERE
Measures.ContributionToParent

Cell-by-Cell Approach
Measures.ContributionToParent
Drink

Food

Non-Consumable

Canada

(null)

(null)

(null)

Mexico

(null)
9.22%

(null)
71.95%

(null)
18.83%

USA

=
(Measures.Sales, Product.Currentmember.Parent)’

measures.[Unit Sales]
Drink

Food

Non-Consumable

Canada

(null)

(null)

(null)

Mexico

(null)

(null)

(null)

USA

$ 24,597.00 $ 191,940.00 $

50,236.00

AS Calc Engine Rules: Null / Null = Null

All Products

/

Canada

(null)

Mexico

(null)

USA

$ 266,773.00

Disadvantages of Cell-by-Cell
• The same navigation for each cell repeated
– Same “relative offset” for each cell yet FE repeats the same navigation
for every cell in the subspace

• Other work repeated for every cell
– An expensive check for recursion to determine if pass should be
decremented
– Eg Sales = Sales * 1.2

• Calculating null values when we should know ahead of time
they are null
– Extremely important in a sparse value subspace

Bulk Mode Goals
• Only calculate non-default values for a subcube
– Usually, but not always, the default value is null

• Perform cell navigation once for each subcube
– Eg PrevMember, Parent – often the same for each cell
in a subcube

Bulk Mode Approach
Drink

Food

Non-Consumable

Canada

(null)

(null)

(null)

Mexico

(null)

(null)

(null)

USA

9.22%

71.95%

18.83%

3) …and everything else is
null
Country

Product

Measure

Value

USA

Drink

Contribution to Parent

9.22%

USA

Food


71.95%

USA

Non-Consumable


18.83%

Country Product

Measure Value

Country Product

USA

Drink

Sales

$24,597.00

USA

USA

Food

Sales

$191,940.00

USA

NonConsumable Sales

$50,236.00

1. Retrieve non-null values from storage engine

2) Perform the computation
for the non-null values only 3 computations
instead of 9…
Measure

All Products Sales

Value
$266,773.00

The Formula Engine Cache
• The results of calculations are then stored in the FE
cache
– Future queries can then avoid doing calculations if the
results can be retrieved from cache

• However, FE caches may exist only for the lifetime of
a query or session

Using Profiler
• SQL Profiler can be used to observe what happens
within SSAS when a query executes
– MDX Studio provides much the same information,
often in a more readable form

• Provides good information on what’s happening in
the SE, much less on FE
• It’s a good idea to create a new trace template with
the following events...

Profiler Events
•
•
•
•
•
•

Query Begin/End
Execute MDX Script Begin/End
Query Cube Begin/End
Progress Report Begin/End
Get Data From Aggregation
Query Subcube Verbose (shows same data as Query
Subcube in a more readable form)
• Resource Usage

Interpreting Query Subcube Verbose
• Query Subcube and Query Subcube Verbose show the subcube
requests from the FE to the SE
– Single grain only
– Granularity bitmask
• 1 is in grain
• 0 is not in grain
Symbol
*
DataID

Meaning
All members at granularity
Single member slice

+

Set slice at granularity

-

Set slice below granularity

0

No slice and no granularity
Arbitrary shape

Query Performance Benchmarking
• When performance tuning, you need to have a
benchmark to test against
– Are you improving performance?

• Methodology:
– Ensure no-one else is running queries on the server
– Ensure you’re running queries on prod data volumes and
server specs
– Clear Cache
– Run query and measure Duration, to get cold cache
performance
– Run query again to get warm cache performance

Designing Effective Aggregations in Analysis
Services 2008

Agenda
•
•
•
•
•
•

What are aggregations and why should I build them?
The importance of good dimension design
The Aggregation Design Wizard
Usage-based optimisation
Designing aggregations manually
Influence of cube design on aggregation usage

What is an aggregation?
• It’s a copy of the data in your fact table, preaggregated to a certain level
– Created when the cube is processed
– Stored on disk

• Think of it as being similar to the results of a GROUP
BY query in SQL
• It makes queries fast because it means SSAS does not
have to aggregate as much data at query time

Visualising aggregations
• You can actually see an aggregation if you:
–
–
–
–

Create a cube using ROLAP storage
Build aggregations
Process the cube
Look in your relational data source at the indexed
view or table created

The price you pay
• Aggregations are created at processing time
• Therefore building more aggregations means
processing takes longer
• Also increases disk space used by the cube
• But SSAS can build aggregations very quickly
• And with the ‘right’ aggregation design, relatively
little extra processing time or disk space is needed

Should I pre-aggregate everything?
• NO!
• There is no need: you don’t need to hit an
aggregation direct to get the performance benefit
– In fact, too many aggregations can be bad for query
performance!

• Building every possible aggregation would also lead
to:
– Very, very long processing times
– Very, very large cubes (‘database explosion’)

Database explosion
10

20

30

40

10

20

30

30

40

70

40

60

100

5 extra values needed to hold all possible aggregations on
a table that had only 4 values in it originally!

When aggregations are used
• Aggregations are only useful when the Storage
Engine has to fetch data from disk
• Aggregations will not be used if the data is in the
Storage Engine cache
• Aggregations may not be useful if the cause of query
performance problems lies in the Formula Engine

Aggregation designs
• Each measure group can have 0 or more aggregation
design objects associated
– They are what gets created when you run either of the
Aggregation Design wizards
– They detail which aggregations should be built
– Remember to assign designs to partitions!

• Each partition in a measure group can be associated
with 0 or 1 aggregation designs
• Aggregations are built on a per-partition basis

Aggregation design methodology
1. Ensure your dimension design is correct
2. Set all properties that affect aggregation design
3. Run the Aggregation Design Wizard to build some
initial aggregations
4. Perform Usage-Based Optimisation for at least a
few weeks, and repeat regularly throughout the
cube’s lifetime (or automate!)
5. Design aggregations manually for individual queries
when necessary

Dimension design
• Dimension design has a big impact on the
effectiveness of aggregations
• Your dimension designs should be stable before you
think about designing aggregations
• Three important things:
– Delete any attributes that won’t ever be used
– Check your attribute relationships to ensure they are
set optimally
– Build any natural user hierarchies you think might be
needed

Attribute relationships
• Attribute relationships allow SSAS to derive values at
higher granularities from aggregations built at lower
granularities
• The more ‘bushy’ the attribute relationships on your
dimension, the more effective aggregations will be
• Flat attribute relationships make designing
aggregations much harder

Properties to set
• Before starting aggregation design, you should set
the following properties:
– EstimatedRows property for partitions
– EstimatedCount property for dimensions
– AggregationUsage property for attributes of cube
dimensions

• Setting these correctly will ensure the aggregation
design wizards will do the best job possible

AggregationUsage
• The AggregationUsage property has the following
possible values:
– Full – meaning every aggregation will include this
attribute
– None – meaning no aggregation will include this
attribute
– Unrestricted – meaning an aggregation may include
this attribute
– Default – means the same as Unrestricted for key
attributes, or attributes used in natural user
hierarchies

• The Aggregation Design Wizard is a good way to
create a ‘first draft’ of your aggregation design
• Do not expect it to build every aggregation you’ll
ever need...
• ...it may not even produce any useful aggregations at
all!
• Increased cube complexity from AS2005 onwards
means it has a much harder job

• The first few steps in the wizard ask you to set
properties we’ve already discussed
• However, the Partition Count property can only be
set in the wizard
• It specifies the number of members on an attribute
that are likely to have data in any given partition
• For example, if you partition by month, a partition
will only have data for one month

Set Aggregation Options step
• This is where the aggregations get designed!
• Warning: may take minutes, even hours to run
• Four options for working out which aggregations should be
built:
– Estimated Storage Reaches – build enough aggregations
to fill a certain amount of disk
– Performance Gain Reaches – build aggregations to get
x% of all possible performance gain from aggregations.
– I Click Stop – carry on until you get bored of waiting
– Do Not Design Aggregations

Set Aggregation Options step
• The strategy I use is:
– Choose I Click Stop, then see if the design process
finishes quickly and if so, how many aggregations are
built and what size
– If you have a reasonable size and number of
aggregations (build no more than 50 aggregations
here), click Finish
– Otherwise, click Stop and Reset and choose
Performance Gain Reaches and set it to 30%
– If only a few, small aggregations are designed,
increase by 10% and continue until you are happy

Usage-Based Optimisation
• Usage-Based Optimisation involves logging Query
Subcube requests and using that information to influence
aggregation design
• To set up logging:
– Open SQL Management Studio
– Right-click on your instance and select Properties
– Set up a connection to a SQL Server database using the
LogQuery Log properties
– The QueryLogSampling property sets the % of requests to
log – be careful, as setting this too high might result in
your log table growing very large very quickly

Usage-Based Optimisation
• You should log for at least a few weeks to get a good
spread of queries
• Remember: any changes to your dimensions will
invalidate the contents of the log
• Next, you can run the Usage-Based Optimisation
wizard
• This is essentially the same as the Aggregation
Design wizard, but with log data taken into account
(you can also filter the data in the log)

Manual aggregation design
• You will find that sometimes the wizards do not build the
aggregations needed for specific queries
– Running the wizard is a bit hit-and-miss
– The wizards may never build certain aggregations, for
example they may think they are too large
• In this case you will need to design aggregations manually
• BIDS 2008 allows you to do this on the Aggregations tab of the
cube editor
• BIDS Helper (http://www.codeplex.com/bidshelper) has
better functionality for this

Manual aggregation design
• To design aggregations manually:
–
–
–
–

Clear the cache
Start a Profiler trace
Run your query
Look for Query Subcube events that take more than
500ms
– Build aggregations at the same granularity as the
Query Subcube events
– Save, deploy then run a ProcessIndex

Redundant attributes
• Often more than one attribute from the same
dimension will be used in a subcube request
• If those attributes are connected via attribute
relationships, only include the lowest one in the
aggregation
– Eg If Year, Quarter and Month are selected, just use
Month

• This will ensure that the aggregation can be used by
the widest range of queries

Influence of cube design
• There is a long list of features you might use in your
cube design that affect aggregation design and usage
• Mostly they result in queries executing at lower
levels of granularity than you’d expect
• This means aggregations are less likely to be hit, and
so less useful
• BIDS Helper and the BIDS Aggregations tab will help
you spot these situations

Many-to-many relationships
• Aggregations will never be used on an intermediate
measure group in a many-to-many relationship, if the
measure group is only used as an intermediate
measure group
• M2M queries are resolved at the granularity of the
key attributes of all dimensions common to the main
measure group and the intermediate measure group
• Therefore aggregations are less likely to be hit on the
main measure group

Semi-additive measures
• Queries involving semi-additive measures are always
resolved at the granularity attribute of the Time
dimension
• Therefore any aggregations must include the
granularity attribute of the Time dimension if they’re
to be used

Partitions
• It’s pointless to build aggregations above the
granularity of the attribute you’re using to slice your
partitions
– Eg, if you partition by Month, don’t build aggregations
above Month level
– Nothing stops you doing this though!

• Do this and you get no performance benefit and limit
the queries the aggregations can be used for

Parent-child hierarchies
• You cannot build aggregations that include parentchild hierarchies
• That also means that the levels that appear ‘within’
the parent-child structure cannot be used in an
aggregation
• You should build aggregations on the Key Attribute of
your dimension instead
– Very important if you have a DefaultMember or if
IsAggregatable=False

MDX Calculations
• Calculated members and MDX Script assignments are
very likely to request data for different granularities
from the one you’re querying
• Don’t assume – check what’s happening in Profiler!

Other things to watch for
• Measure expressions cause queries to be evaluated
at the common granularity of the two measure
groups involved
• Unary operators other than + and Custom Rollups
will also force query granularity down
• Do not include attributes from non-materialised
reference dimensions in aggregations as this may
result in incorrect values being returned

Monitoring Aggregation Usage
• Profiler can show when aggregations are used:
– Get Data From Aggregation event is fired
– Progress Report Begin/End events show reads from
aggregations
– ObjectPath column shows exactly which objects are
being read from

• XPerf utility can be used to determine how much
time is spent in IO and how much time is spent
aggregating data

Agenda
•
•
•
•
•
•

Avoiding using MDX calculations
Measure Expressions
Bulk mode and Cell-by-Cell mode
When Bulk Mode can be used
Non Empty evaluation
Using Profiler, Perfmon and MDX Studio to monitor FE
activity
• Using named sets effectively
• Using caching to avoid recalculating expressions
• Scoping calculations appropriately

Avoiding MDX
• Doing calculations in MDX should be a last resort - if
you can design something into the cube, then do so
• Examples:
– Using calculations to aggregate members instead of
creating a new attribute
– Using calculations instead of creating a m2m
relationship

Measure Expressions
• Measure Expressions are simple calculations that are
performed in the SE not the FE
– Set using a measure’s MeasureExpression property
– Can take advantage of parallelism with partitioning
– Faster than FE equivalent, though not always by much

• Take the form M1*M2 or M1/M2, where M1 and M2 are
measures from two different measure groups with at
least one common dimension
– Calculation takes place before measure aggregation

• Useful for:
– Currency conversion
– Percentage ownership/share

Bulk Mode MDX Functions
• Most MDX functions can use bulk mode
• Some, however, cannot – so using them can seriously
harm query performance
– See the topic “Performance Improvements for SQL
Server 2008 Analysis Services” in BOL for a list
– MDX Studio can identify them too

Other Ways to Force Cell-by-Cell
• Use cell security
• Use SSAS stored procedures/custom functions
• Use named sets or set aliases inside calculations
(before SSAS 2012)
• Reference calculations that are defined later on in
the MDX Script
• Use non-deterministic functions, eg NOW()

Non Empty Evaluation
• SSAS regularly needs to filter empty cells
– When queries use NON EMPTY or NonEmpty()
– As part of bulk mode evaluation

• It has two ways of doing this:
– Slow algorithm that evaluates all cell values and then
filters
• Can be very slow when calculations are involved

– Fast algorithm that ‘knows’ if a cell is empty by
looking at raw measure values from the SE

Non_Empty_Behavior
• The Non_Empty_Behavior property in SSAS 2000 and
2005 was a hint to use the fast algorithm
• Should not be used in SSAS 2008 – 98% of the time
the engine knows when to use it anyway
– Was probably the most misused feature in 2005,
leading to incorrect values coming from the cube
– In 2008 only use it if you really know what you’re
doing and as a last resort

Using Profiler To Monitor the FE
• Profiler gives us little useful information on what’s
happening in the FE
• Useful events include:
– Calculate Non Empty Begin/End
• When the value of IntegerData is 11, SSAS is using the
cell-by-cell algorithm

– Get Data From Cache – shows reads from cache, and
which type of cache is used

• But usually these events return too much
information to be useful...

Using Perfmon to Monitor the FE
• The following counters are useful:
– Total Calculation Covers – gives the number of
subcubes over which a common set of calculations is
evaluated. High is bad.
– Total Cells Calculated – the number of cell values
evaluated for a query. High is bad.
– Total Sonar Subcubes – the number of SE subcube
requests generated by Sonar
– Total NON EMPTY/unoptimized/for calculated
members – the number of times a Non Empty is
evaluated

Other Tools
• Kernrate Viewer can be used to identify FE problems
relating to stored procedures
– Shouldn’t be using stored procedures anyway

• MDX Script Performance Analyser can help identify
individual problem calculations
– Often, though, the problem is with a combination of
overlapping calculations

MDX Studio
• MDX Studio is a great tool for diagnosing FE
problems
– Combines Profiler and Perfmon data
– Identifies functions you shouldn’t be using, and links
to blog posts with more detail
– But unlikely to be developed further

MDX Don’ts
• Do not use the LookUpCube function
• Do not use the NonEmptyCrossjoin function, use NonEmpty
instead
• Do not use the StrToX functions inside calculated members
– If you must use them, use the Constrained flag

• Do not use non-deterministic functions inside calculated
members
– Eg Now(), UserName()

• Do not compare objects by name, use the IS operator instead
• Do not return anything other than a Null from an IIF when you
don’t want a calculation to take place

MDX Avoids
• Avoid using subselects wherever possible
– In some cases, despite the impact on caching, it can
produce faster queries though

• Avoid using the LinkMember function
• Avoid using cell security

Using Named Sets
• Named sets can be used to avoid re-evaluating the
same set expression multiple times
– Eg finding the rank of a member in an ordered set,
where the set order remains static

• This can result in massive performance gains
• However, remember that using named sets also
disables bulk mode before 2012

Rewriting to use Bulk Mode
• In some cases, rewriting calculations to avoid certain
functions can enable the use of bulk mode
– Eg Count(Filter()) scenario
– Somewhat non-intuitive, but works
– May get fixed in future service pack or version

Using Caching
• The FE can only cache cell values, not the value of
individual expressions
• So if you have an expensive expression that returns a
numeric value, then it could be worthwhile to put it
into its own calculated measure
– Can then be hidden

• After the first time it is evaluated, its value will be
cached, so any other calculations that use it can read
values from cache

Agenda
• How Analysis Services caching works
• When and why Analysis Services can’t cache data
• Warming the Storage Engine cache with the CREATE
CACHE statement
• Warming the Formula Engine cache by running
queries
• Automating cache warming

Types of Analysis Services cache
• Analysis Services can cache two types of value:
– Values returned by the Storage Engine
• ‘Raw’ measure values, one cache per measure group
• Dimension data, one cache per dimension

– Values returned by the Formula Engine
• Numeric values only – strings can’t be cached

• All SE caches have the same structure, known as the data
cache registry
• The FE can also store values in this structure if
calculations are evaluated in bulk mode
• The FE uses a different structure, the flat cache, for
calculations evaluated in cell-by-cell mode

Storage Engine Cache
• Data in the data cache registry is held in subcubes, ie
data at a common granularity
• Subcubes may not contain an entire granularity –
they may be filtered
• Prefetching is generally good for cache reuse
• Arbitrary-shaped subcubes cannot be cached
• Try to redesign hierarchies or queries to avoid this

Storage Engine Cache Aggregation
• SE cache data can be aggregated to answer queries in
some cases
• Except when the measure data itself cannot be
aggregated, for example with distinct count measures or
many-to-many

• Unlike aggregations, the SE cache isn’t aware of attribute
relationships
• Aggregation can only happen from the lower level of an
attribute hierarchy to the All Member
• For this reason user hierarchies can offer better
performance because of the subcubes they generate

Formula Engine Cache Scopes
• There are three different ‘scopes’ or lifetimes of a FE cache:
– Query – for calculations defined in the WITH clause of a query,
the FE values can only be cached for the lifetime of the query
– Session – for calculations defined for a session, using the
CREATE MEMBER statement executed on the client, FE values
can only be cached for the lifetime of a session
– Global – for calculations defined in the cube’s MDX Script, FE
values can be cached until either
• Any kind of cube processing takes place
• A ClearCache XMLA command is executed
• Writeback is committed
• Global scope is best from a performance point of view!

Cache Sharing
• Values stored in the SE cache can always be shared
between all users
• Values stored in the FE cache can be shared between
users, except when:
– Stored in Query or Session-scoped caches
– Users belong to roles with different dimensions security
permissions
• Note: dynamic security always prevents cache sharing

• Calculations evaluated in bulk mode cannot reference
values stored in the FE flat cache
• Calculations evaluated in cell-by-cell mode cannot
reference values stored in the FE data cache registry

Forcing Query-scoping
• In certain circumstances, SSAS uses query-scoped FE
caches when you would expect it to use global scope
• These are:
– Calculations that use the Username or LookupCube
functions
– Calculations use non-deterministic functions such as Now()
or any SSAS stored procedures
– Queries that use subselects
– When any calculated member is defined in the WITH
clause, whether it is referenced or not in the query
– When cell security is used

Warming the SE cache
• Considerations for warming the SE cache:
– We want to avoid cache fragmentation, for example having
one unfiltered subcube cached rather than multiple
filtered subcubes
– It is possible to overfill the cache – the SE will stop looking
in the cache after it has searched 1000 subcubes
– We want to cache lower rather than higher granularities,
since the latter can be aggregated from the former in
memory
– We need a way of working out which granularities are
useful

Warming the SE cache
• We can warm the SE cache by using either:
– WITH CACHE, to warm the cache for a single query – not
very useful
– The CREATE CACHE command
• Remember that building aggregations is often a better
alternative to warming the SE cache
• But in some cases you can’t build aggregations – for example
when there are many-to-many relationships

CREATE CACHE
• Example CREATE CACHE statement:

CREATE CACHE
FOR [Adventure Works] AS
'({[Measures].[Internet Sales Amount]},
{[Date].[Date].[Date].MEMBERS},
{[Product].[Category].[Category].MEMBERS})'

Which subcubes should I cache?
• The Query Subcube and Query Subcube Verbose
events in Profiler show the subcubes requested from
the SE by the FE
• This is also the information stored in the SSAS query
log, stored in SQL Server
• Analyse this data manually and find the most
commonly-requested, lower-granularity subcubes
• Maybe also query the Query Log, or a Profiler trace
saved to SQL Server, to find other subcubes –
perhaps for queries that have been run recently

Warming the FE cache
• First, tune your calculations! Ensure use of bulk mode
where possible
• The only way to warm the FE cache is to run MDX queries
containing calculations
• Remember, these queries must not:
– Include a WITH clause
– Subselects

• Also, no point trying to cache calculations whose values
cannot be cached
• And think about how security can impact cache usage

Queries to warm the FE Cache
• Again, it is worth manually constructing some MDX queries yourself to
warm the FE cache
• Also, running regularly-used queries (for example those used in SSRS
reports) can be a good idea
• Can easily collect the queries your users are running by running a
Profiler trace, then saving that trace to SQL Server or a .trc file
– The Query Begin and Query End events contain the MDX query
– Need to filter out those with a WITH clause etc
– Watch out for parameterisation (eg SSRS)
– Watch out for use of session sets and calculations (eg Excel 2003)
– Watch out for queries that slice by Time, where the actual slicer
used may change regularly
– Think about the impact of dimension security too

Memory considerations
• SSAS caching can use a lot of memory!
• The cache will keep growing until SSAS thinks it is running out
of memory:
– When memory usage exceeds the % of available system
memory specified in the LowMemoryLimit property, data will be
dropped from cache
– When it exceeds the % specified in the TotalMemoryLimit
property, all data will be dropped from cache
– We therefore don’t want to exceed the LowMemoryLimit
– We also want to avoid paging
– We need to leave space for caching real user queries

• The FE flat cache is limited to 10% of the TotalMemoryLimit
– If it grows bigger than that, it is completely emptied

Automating Cache Warming
• We should perform cache-warming after cube processing
has taken place
• Remember – it may take a long time! It should not
overlap/interfere with real users querying
• We can automate it a number of different ways:
– Running SSRS reports on a data-driven subscription
– Using the ascmd.exe utility
– Building your own SSIS package – the best solution for
overall flexibility.
• Either fetch queries from a SQL Server table
• Or from a Profiler .trc file using the Konesans Trace File
Source component

Summary
• Clearly a lot of problems to watch out for!
• However, some cache-warming (however inefficient) is
often better than none at all
• A perfectly-tuned cube would have little need for cachewarming, but...
– Some performance problems we just don’t know about
– Some we may not be able to fix (eg with complex
calculations, hardware limitations)
– Cache warming is likely to have some positive impact in
these cases – maybe lots, maybe not much

NASI SPONSORZY I PARTNERZY

Organizacja: Polskie Stowarzyszenie Użytkowników SQL Server - PLSSUG
Produkcja: DATA MASTER Maciej Pilecki

SQLDay2013_ChrisWebb_CubeDesign&PerformanceTuning

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Viewers also liked

Viewers also liked (20)

Similar to SQLDay2013_ChrisWebb_CubeDesign&PerformanceTuning

Similar to SQLDay2013_ChrisWebb_CubeDesign&PerformanceTuning (20)

More from Polish SQL Server User Group

More from Polish SQL Server User Group (20)

Recently uploaded

Recently uploaded (20)

SQLDay2013_ChrisWebb_CubeDesign&PerformanceTuning