The document discusses how Data Vault can be efficiently implemented in SAP Hana, noting that while SAP Hana benefits from its columnar architecture by using a single broad satellite per hub, splitting satellites based on rate-of-change can also be efficient for storage, and multiple satellites are preferable if data comes from multiple sources to improve write efficiency. It also recommends creating one processed information table per hub rather than SQL views to allow for efficient referential joins in Hana.
2. Our Dot on the Horizon
- Central point for delivering healthcare processes data
for medical research
- Integrate various sources
- Historize, trace and pseudonymize all data used
3. Our Journey
- Learning and adapting to Data Vault
not everybody is a modeler (Shu Ha Ri)
- Script, code, build, try, test, throw away and start again
- Testing overrated?
- Architectureimprovements
Performance issues SAS/Microsoft
Performance issues loading scripts
Automate DV load
- From Chaos to SCRUM
4. Our Obstacles
- Registration for healthcare process vs. usability for
research
- Questionnaires: sources or generic models?
- Performance:
Do we really need all complete texts?
Do we really need 20 years of lab results?
- The usual: conflicting interests,politics etc.
5. Our preliminary results
- 2013: selection of 5 major
Studies as starting
showcasesproved difficult
- 2014: had to choose 5
new showcasesfrom 25
applicants
- Started as Research Data
Platform, now growth
towards Enterprise Data
Platform (including
Education and BI)
- Architecturenow stable
6. Lessons learned
• Automate when possible
• Invest in a team of skilled pioneers
• Models rule everything
• Adapt agility, teach agility
14. Groups of Links: context at hospital
Imagine the following:
• An operation (surgery) is executed by a
group of people (first surgeon, second
surgeon, assistant,anesthiologist, etc.)
• An operation is planned a couple of weeks
in advance
• Whenever the planning changes in the
source the complete group is sent to the
EDW
15. Group of Links: the Data
{Time} operation_no employee_no role
T=1 19354 John OP1
19354 Jane OP2
19354 Chris ANA
T=2 19354 John OP1
19354 Mary ANA
T=3 19354 Jane OP1
19354 Chris ANA
Please note: the actual operation with operation_no
19354 is executed by Jane (OP1) and Chris (ANA)
16. Groups of Links: the Problem
Standard Data Vault loading routines cannot
handle this situation:
operation_no employee_no role load_dts
19354 John OP1 T=1
19354 Jane OP2 T=1
19354 Chris ANA T=1
19354 Mary ANA T=2
19354 Jane OP1 T=3
17. Groups of Links: the Problem
Using end-dating of a link (preferable a validity
satellite) cannot handle this problem either:
operation_no employee_no role load_dts Active?
19354 John OP1 T=1 No (T=3)
19354 Jane OP2 T=1 Yes
19354 Chris ANA T=1 No (T=3)
19354 Mary ANA T=2 Yes
19354 Jane OP1 T=3 Yes
BK of link used: operation_no + role
18. Groups of Links: our solution
1. Add a validity satellite to the link (for end-dating)
2. Tell the meta data of the automatin tool this is a
group validity satellite with BK=operation_no
3. Whenever an existing operation_no is present in
the staging layer set all current links to
Active=No
4. Process as usual
• Remark: because the same row can come back
(i.e. John/OP1) it will be set to Active=No and
Active=Yes at the same time there can be no
unique index on BK of Validity satellite and some
cleaning up is required after loading
19. Groups of Links: special thanks to …
St. Antonius Hospital (for having the problem)
Edwin Weber (for coding the solution)
Get your copy of the solution:
http://sourceforge.net/projects/pdidatavaultf
w/
22. Primary key => distribution key
hub -< satellite join
- data redistribution
- join local in parallel
BK SID
Ensemble 1
Dimensional 2
SID LDTS INFO
1 2001-01-01 My first DV
1 2014-06-05 DV Masters
2 1997-08-02 DM manifesto
Node 1
Node 2
23. Hub SID => distribution key
hub -< satellite join
- join local in parallel
BK SID
Ensemble 1
Dimensional 2
SID LDTS INFO
1 2001-01-01 First DV
1 2014-06-05 DV Masters
2 1997-08-02 DM manifesto
Node 1
Node 2
24. Link SID => distribution key
Default L_SID, 1:N & N:M
- data redistribution
- join local in parallel
H_MID H_SID L_SID
1 A 1
1 B 2
L_SID LDTS LDTS_END CURRENT
1 2001-01-01 2006-01-01 N
1 2014-06-05 9999-12-31 Y
2 2006-01-01 2014-06-05 N
H_MID H_SID L_SID
1 A 1
1 B 2
L_SID H_MID H_SID LDTS LDTS_END
1 1 A 2001-01-01 2006-01-01
1 1 B 2014-06-05 9999-12-31
2 1 A 2006-01-01 2014-06-05
1:N => H_MID on link satellite
- join local in parallel
H_MID is the ensemble identifier !
Node 1
Node 2
25. Use the ensemble identifier if possible!
H_SID H_SID LDTS INFO
L_SID? H_SID H_MID H_SID ? L_SID ? LDTS INFO
Distributing data efficiently to ensure good
performance in a MPP database.
- If uneven distribution, one node may become a
bottleneck for the whole execution
Try to minimize data movement between nodes
- Data redistribution may occur when joining tables
Ensemble
27. SAP #Hana is a column store #database which
brings #efficiency in storage and access - #in-
memory.
28. SAP #Hana seems to benefit on their technical
#architecture in using 1 broad Satellite per
#Hub - #benefit no need for #PIT, less tables
29. Splitting #Sat’s in #rate-of-change as efficient
in storage as column store
#multiple Sat’s to prefer if data coming from
multiple sources (#write efficiency)
30. #referential join will only perform the join if
data from the joined tables is used create 1
#PIT per #Hub (not as #SQL view)
31. #Lesson: DV is #efficient way of storing data
#Lesson: #SQL views can’t be read by Hana
Studio
#Lesson: #Hana is still evolving