Are OLAP cubes "large monters" that deliver quick data retrieval at the expense of long upload time? This presentation shows one way to kill this myth.
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Case Study Real Time Olap Cubes
1. Case study:
Quasi real-time OLAP cubes
by Ziemowit Jankowski
Database Architect
2. OLAP Cubes - what is it?
• Used to quickly analyze and retrieve data from different
perspectives
• Numeric data
• Structured data:
o can be represented as numeric values (or sets
thereof) accessed by a composite key
o each of the parts of the composite key belongs to a
well-defined set of values
• Facts = numeric values
• Dimensions = parts of the composite key
• Source = usually a start or snowflake schema in a
relational DB (other sources possible)
3. OLAP Cubes - data sources
Star schema Snowflake schema
4. OLAP Facts and dimensions
• Every "cell" in an OLAP cube
contains numeric data a.k.a
"measures".
• Every "cell" may contain more than
one measure, e.g. forecast and
outcome.
• Every "cell" has a unique
combination of dimension values.
5. OLAP Cubes - operations
• Slice = choose values corresponding to
ONE value on one or more dimensions
• Dice = choose values corresponding to
one slice or a number of consecutive
slices on more than 2 dimensions of
the cube
6. OLAP Cubes - operations (cont'd)
• Drill down/up = choose lower/higher
level details. Used in context of
hierarchical dimensions.
• Pivot = rotate the orientation of the
data for reporting purposes
• Roll-up
7. OLAP Cubes - refresh methods
• Incremental:
o possible when cubes grow
"outwards", i.e. no "scattered"
changes in data
o only delta data need to be read
o refresh may be fast if delta is small
• Full:
o possible for all cubes, even when
changes are "scattered" all over
thedata
o all data need to be re-read with
every
o refresh may take long time (hours)
8. The situation on hand
• Business operating on 24*6 basis (Sun-Fri)
• Events from production systems are aggregated into flows
and production units
• Production figures may be adjusted manually long after
production date
• Daily production figures are basis for daily forecasts with
the simplified formula:
forecast(yearX) = production(yearX-1) * trend(yearX) + manualFcastAdjustm
• Adjustments in production figures will alter forecast
figures
• Outcome and forecast should be stored in MS OLAP cubes
as per software architecture demands
• The system should simplify comparisons between forecast
and outcome figures
9. Software
• Source of data:
o Relational database
o Oracle 10g database
o extensive use of PL/SQL in database
• Destination of data:
o OLAP cubes - MS SQL Server Analysis Services (version
2005 and 2008)
• Other software:
o MS SQL Server database
10. QUESTION
Can we get almost real-time reports from MS OLAP cubes?
ANSWER
YES! The answer lies in "cube partitioning".
11. Cube partitioning - the basics
• Cube partitions may be updated independently
• Cube partitions may not overlap (duplicate values may
occur)
• Time is a good dimension to partition on
12. MS OLAP cube partitioning - details
• Every cube partition has its own query to define the data
set fetched from the data source
• The SQL statements define the non-overlapping data sets
14. How to partition? - theory
• Partitions with different lengths and different update
frequencies:
o current data = very small partition, very short update
times, updated often
o "not very current" data = a bit larger partition, longer
update times, updated less often
o historical data = large partition, long update times,
updated seldom
• Operation 24x6 delivers the "seldom" window
15. How to partition? - theory cont'd
• One cube for both forecast and outcome
16. Solution - approach one
Decisions:
• Cubes partitioned on date boundaries
• MOLAP cubes (for better queryperformance)
• Use SSIS to populate cubes
o dimensions populated by incremental processing
o facts populated by full processing
o jobs for historical data must be run after midnight to
compensate for date change
Actions:
• Cubes built
• SSIS deployed inside SQL Server (and not filesystem)
• SSIS set up as scheduled database jobs
17. Did it work?
No!
Malfunctions:
• Simultaneous updates of cube partitions could lead to
deadlocks
• Deadlocks left cube partitions in unprocessed state
Amendment:
• Cube partitions must not be updated simultaneously
18. Solution - approach two
Decisions:
• Cube processing must be ONE partition at a time
• Scheduling done by SSIS "super package":
o SQL Server table contains approx. frequency and
package names
o "super package" executes SSIS packages as indicated by
the table
Actions:
• Scheduling table created
• "Super package" created to be self-modifying
19. Did it work?
Not really!
Malfunctions:
• Historical data had to be updated after midnight and real-
time updates for "Now" partition were postponed. This was
done to avoid "gaps" in outcome data and "overlappings" in
forecast data.
• Real-time updates ended soon after midnight and were
resumed a few hours later. (That was NOT acceptable.)
Amendment:
• Re-think!
20. Solution - approach three
Decisions:
• Take advantage of 6*24 cycle (as opposed to 7*24)
• Switch dates on Saturdays only
o the "Now" partition had to stretch from Saturday to
Saturday
o all other partitions had to stretch from a Saturday to
another Saturday
• Re-process all time-consuming partitions on Saturday after
switch of date
21. Solution - approach three cont'd
Actions:
• Create logic in Oracle database to do date calculations
"modulo week", i.e. based on Saturday. Logic implemented
as function.
• Rewrite SQL statements for cube partitions so that they
employ the Oracle function (as above) instead of current
date +/- given number of days.
• Reschedule the time consuming updates so they run every
7th day.
23. Lessons learned
• It is possible to build real-time OLAP cubes in MS
technology
• It is possible to make the partitions self-maintaining in
terms of partition boundaries
• The concept need careful engineering as there are pits in
the way.
24. Omitted details
Some details have been omitted:
• the quasi real-time updates are scheduled to occur every
2nd or 3rd minute
• scheduling is not exact, as the Super-job keeps track of
what is to be run and when and executes SSIS packages
based on "scheduled-to-run" state, their priority and a few
other criteria
• the source of data is not a proper star schema, it is rather
an emulation of facts and dimensions by means of data
tables and views in Oracle.