Maintaining a data historization is a very common but time consuming task in a data warehouse environment. The common techniques used involve outer joins and some kind of change detection. This change detection must be done with respect of Null-values and is possibly the most trickiest part. But, on the other hand, SQL offers standard functionality with exactly desired behaviour: Group By or Partitioning with analytic functions. Can it be used for this task?
2. Unser Unternehmen.
Trivadis DOAG17: SCD2 mal anders2 29.11.2018
Trivadis ist führend bei der IT-Beratung, der Systemintegration, dem Solution
Engineering und der Erbringung von IT-Services mit Fokussierung auf -
und -Technologien in der Schweiz, Deutschland, Österreich und
Dänemark. Trivadis erbringt ihre Leistungen aus den strategischen Geschäftsfeldern:
Trivadis Services übernimmt den korrespondierenden Betrieb Ihrer IT Systeme.
B E T R I E B
3. KOPENHAGEN
MÜNCHEN
LAUSANNE
BERN
ZÜRICH
BRUGG
GENF
HAMBURG
DÜSSELDORF
FRANKFURT
STUTTGART
FREIBURG
BASEL
WIEN
Mit über 600 IT- und Fachexperten bei Ihnen vor Ort.
Trivadis DOAG17: SCD2 mal anders3 29.11.2018
14 Trivadis Niederlassungen mit
über 600 Mitarbeitenden.
Über 200 Service Level Agreements.
Mehr als 4'000 Trainingsteilnehmer.
Forschungs- und Entwicklungsbudget:
CHF 5.0 Mio. / EUR 4.0 Mio.
Finanziell unabhängig und
nachhaltig profitabel.
Erfahrung aus mehr als 1'900 Projekten
pro Jahr bei über 800 Kunden.
4. Über mich
Trivadis DOAG17: SCD2 mal anders4 29.11.2018
Senior Consultant bei der Trivadis GmbH, Düsseldorf
Schwerpunkt Oracle
– Data Warehousing
– Application Development
– Application Performance
Kurs-Referent „Oracle 12c New Features für Entwickler“
und „TechnoCircle Oracle 12c Release 2“
Blog: http://blog.sqlora.com
5. Agenda
Trivadis DOAG17: SCD2 mal anders5 29.11.2018
1. Introduction and state of the art
2. The „new“ approach
3. Use cases and performance
4. Conclusion
7. Introduction
Trivadis DOAG17: SCD2 mal anders7 29.11.2018
Historization? As a part of loading process in a data warehouse
We consider Slowly Changing Dimensions Type II
All changes are completely tracked. The change in at least one of the tracked
columns toggles the creation of the new version record
The most challenging task is the change detection
DWH_KEY VALID_FROM VALID_TO CUR_VERSION ETL_OP BUS_KEY FIRST_NAME SECOND_NAMES LAST_NAME HIRE_DATE FIRE_DATE SALARY
1 01.12.2016 02.12.2016 N UPD 123 Roger Federer 01.01.2010 900000
11 03.12.2016 Y INS 123 Roger Federer 01.01.2010 920000
6 02.12.2016 02.12.2016 N UPD 345 Venus Williams 01.11.2016 500000
10 03.12.2016 Y INS 345 Venus Williams 01.11.2016 01.12.2016 500000
2 01.12.2016 02.12.2016 N UPD 456 Rafael Nadal 01.05.2009 720000
3 01.12.2016 01.12.2016 N UPD 789 Serena Williams 01.06.2008 650000
5 02.12.2016 Y INS 789 Serena Jameka Williams 01.06.2008 650000
8. State of the Art
Trivadis DOAG17: SCD2 mal anders8 29.11.2018
Typical OWB mapping
9. BK_T C1_T C2_T
11 A BB
22 D E
77 M N
33 F G
State of the Art
Trivadis DOAG17: SCD2 mal anders9 29.11.2018
BK C1 C2
11 A B
22 D E
44 K L
77 M
BK C1 C2
11 A BB
22 D E
33 F G
77 M N
BK_S C1_S C2_S
11 A B
22 D E
44 K L
77 M
NVL(C2_S,'(NULL)') != NVL(C2_T,'(NULL)')
LNNVL(C2_S = C2_T) AND NVL(C2_S, C2_T) IS NOT NULL
DECODE, STANDARD_HASH, SYS_OP_MAP_NONNULL …
Full
Outer
Join
Change
Detection?
Old
Versions
New
Versions
Old
New
Target
Source
Target
Split
UNION ALL
MERGE
More on delta detection: https://danischnider.wordpress.com/2016/10/08/delta-detection-in-oracle-sql/
Data to the left has to be
accessed twice!
10. State of the Art
Trivadis DOAG17: SCD2 mal anders10 29.11.2018
Change detection must be done with respect to null values
Comparing each and every column in a complex way
Or maintaining and comparing hash-diffs: common rules needed, re-hashing after
structural changes sometimes needed
Full outer join may be expensive if not working with „deltas“
Splitting the join result into two data sets causes this join to be made twice
Another
solution?
12. The „new“ approach
Trivadis DOAG17: SCD2 mal anders12 29.11.2018
The „new“ approach is not really new
Oft used for ad hoc queries
Are these two records different?
Using Group BY
BK C1 C2 C3 C4 … … C467 C468 C469
11 A B C D … … AA BB CC
11 A B C D … … AB BB CC
SELECT COUNT(*)
FROM t
GROUP BY BK, C1, C2, C3, C4, … C467, C468, C469
13. The „new“ approach
Trivadis DOAG17: SCD2 mal anders13 29.11.2018
Or using analytical function:
If count equals 2 – they are the same
If count equals 1 – they are different
For GROUP BY and PARTITION BY:
NULL=NULL, VALUE!=NULL
SELECT COUNT(*) OVER (PARTITION BY BK, C1, C2, C3, … C468, C469)
FROM t;
But what
about NULLs?
14. BK C1 C2
11 A BB
33 F G
77 M N
S_T BK C1 C2
T 11 A BB
T 22 D E
T 33 F G
T 77 M N
The „new“ approach
Trivadis DOAG17: SCD2 mal anders14 29.11.2018
BK C1 C2
11 A B
22 D E
44 K L
77 M
BK C1 C2
11 A BB
22 D E
33 F G
77 M N
UNION ALL
Target
Source
Target
GROUP BY MERGE
S_T BK C1 C2
S 11 A B
S 22 D E
S 44 K L
S 77 M
MIN
(S_T)
S
S
S
S
T
T
T
DEMO!
BK C1 C2
11 A B
22 D E
44 K L
77 M
CNT
1
2
1
1
1
1
1
17. Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders17 29.11.2018
Source
Older
Versions
Full Data
Current
VersionsJOIN
may be
slow
Filter
may be
slow
Partitio-
ning?
Target
Full Data Load
Full Data
Current
Versions
Group By
may be
slow
UNION ALLLegacy New
18. Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders18 29.11.2018
Source
Delta
JOIN Filter
may be
slow
Partitio-
ning?
Older
Versions
Current
Versions
Target
Delta Load
Delta
Current
Versions
Group By
may be
slow
UNION ALLLegacy New
19. Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders19 29.11.2018
Source
Older
Versions
Delta
Current
Versions
JOIN
Filter
Business_key
IN …
Target
Delta Load with pre-filter
Delta
Current Ver-
sions (filtered)
Group By
fast
UNION ALLLegacy New
20. Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders20 29.11.2018
Data Warehouse with Siebel-CRM as a source
Order table S_ORDER – 120 columns „only“
Comparing legacy approach vs. GROUP BY vs. analytical functions
Full staging table as a source vs. delta (with or without pre-filtering)
Ca. 6 Mio rows in the target table
Ca. 3 Mio rows in the full load dataset
Ca. 3000 rows in the delta load dataset
21. Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders21 29.11.2018
Method Delta Load, min Full Load, min
Outer Join (legacy approach) 0:09 0:41
GROUP BY 1:10 1:04
GROUP BY with pre-filter 0:04 N/A
Analytic Function 2:12 4:52
Analytic with pre-filter 0:12 N/A
23. Legacy New
Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders23 29.11.2018
Source
Older
Versions
Current
Versions
Core
Current
Versions
Dim
JOIN
may be
slow
Filter
may be
slow
Partitio-
ning?
Target
Loading Dimensions from Core
Current
Versions
Core
Current
Versions
Dim
Group By
may be
slow
UNION ALL
Older
Versions
24. Legacy New
Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders24 29.11.2018
Source is a View
Older
Versions
Current
VersionsJOIN
may be
slow
Filter
may be
slow
Partitio-
ning?
Target
Loading Dimensions from Core
Full Data
Current
Versions
Group By
may be
slow
UNION ALL
25. Use Cases and Performance
Trivadis DOAG17: SCD2 mal anders25 29.11.2018
Loading of a dimension via view
The view joins some „big“ tables (50 Gb, 40+ Mio rows)
And produces < 500 dimension records per day
The loading time could be reduced by 45 percent (3 min 50 sec → 2 min)
26. Conclusion
Trivadis DOAG17: SCD2 mal anders26 29.11.2018
It is simpler and faster in certain cases
The source is queried only once, can be significant if the source is a view
The code can be simply generated
Simple to build even without generation (only a plain list of columns to Copy&Paste)
It‘s worth to do an ad hoc testing with your data
Test it!
28. Trivadis @ DOAG 2017
#opencompany
Stand: 3ter Stock, direkt an der Rolltreppe
Wir teilen unser Know how!
Einfach vorbei kommen, Live-Präsentationen
und Dokumentenarchiv
T-Shirts, Gewinnspiel und mehr
Wir freuen uns wenn Sie vorbei schauen
29.11.2018 Trivadis DOAG17: SCD2 mal anders29