Data Vault ReConnect Speed Presenting PM Part Three

Presenter:
Date:
Note:
Company:
eMail:
Marc Bouma
June 5, 2014
UMC Utrecht
m.c.bouma@umcutrecht.nl

Our Dot on the Horizon
- Central point for delivering healthcare processes data
for medical research
- Integrate various sources
- Historize, trace and pseudonymize all data used

Our Journey
- Learning and adapting to Data Vault
 not everybody is a modeler (Shu Ha Ri)
- Script, code, build, try, test, throw away and start again
- Testing overrated?
- Architectureimprovements
 Performance issues SAS/Microsoft
 Performance issues loading scripts
 Automate DV load
- From Chaos to SCRUM

Our Obstacles
- Registration for healthcare process vs. usability for
research
- Questionnaires: sources or generic models?
- Performance:
 Do we really need all complete texts?
 Do we really need 20 years of lab results?
- The usual: conflicting interests,politics etc.

Our preliminary results
- 2013: selection of 5 major
Studies as starting
showcasesproved difficult
- 2014: had to choose 5
new showcasesfrom 25
applicants
- Started as Research Data
Platform, now growth
towards Enterprise Data
Platform (including
Education and BI)
- Architecturenow stable

Lessons learned
• Automate when possible
• Invest in a team of skilled pioneers
• Models rule everything
• Adapt agility, teach agility

Presenter:
Date:
Note:
Company:
eMail:
Twitter:
Sander Robijns
June 5, 2014
Estrenuo BVBA
sander.robijns@gmail.com
@srobijns

The Issue
No enterprise-wide business keys

The Current Approach
Using recursive links on hubs to identify the
same-as relationship

The Struggle
Getting the facts reported under a single
business key

The Future Approach
Master Data Management will take away some
of the struggles

The Lesson Learned
Get the enterprise-wide business keys in place
first using data governance

Presenter:
Date:
Note:
Company:
eMail:
Twitter:
Kasper de Graaf
June 5 2014
Occurro
kasper@occurro.nl
kdgraaf

Groups of Links: context at hospital
Imagine the following:
• An operation (surgery) is executed by a
group of people (first surgeon, second
surgeon, assistant,anesthiologist, etc.)
• An operation is planned a couple of weeks
in advance
• Whenever the planning changes in the
source the complete group is sent to the
EDW

Group of Links: the Data
{Time} operation_no employee_no role
T=1 19354 John OP1
19354 Jane OP2
19354 Chris ANA
T=2 19354 John OP1
19354 Mary ANA
T=3 19354 Jane OP1
19354 Chris ANA
Please note: the actual operation with operation_no
19354 is executed by Jane (OP1) and Chris (ANA)

Groups of Links: the Problem
Standard Data Vault loading routines cannot
handle this situation:
operation_no employee_no role load_dts
19354 John OP1 T=1
19354 Jane OP2 T=1
19354 Chris ANA T=1
19354 Mary ANA T=2
19354 Jane OP1 T=3

Groups of Links: the Problem
Using end-dating of a link (preferable a validity
satellite) cannot handle this problem either:
operation_no employee_no role load_dts Active?
19354 John OP1 T=1 No (T=3)
19354 Jane OP2 T=1 Yes
19354 Chris ANA T=1 No (T=3)
19354 Mary ANA T=2 Yes
19354 Jane OP1 T=3 Yes
BK of link used: operation_no + role

Groups of Links: our solution
1. Add a validity satellite to the link (for end-dating)
2. Tell the meta data of the automatin tool this is a
group validity satellite with BK=operation_no
3. Whenever an existing operation_no is present in
the staging layer set all current links to
Active=No
4. Process as usual
• Remark: because the same row can come back
(i.e. John/OP1) it will be set to Active=No and
Active=Yes at the same time there can be no
unique index on BK of Validity satellite and some
cleaning up is required after loading

Groups of Links: special thanks to …
St. Antonius Hospital (for having the problem)
Edwin Weber (for coding the solution)
Get your copy of the solution:
http://sourceforge.net/projects/pdidatavaultf
w/

Presenter:
Date:
Note:
Company:
eMail:
Twitter:
Juan-Josévan der Linden
June 5, 2014
DV, MPP
QOSQO
juanjose.vanderlinden@qosqo.nl
@delostilos

SMP => MPP => AMPP
SMP
Symmetric
Processing
MPP
Massively
Parallel
Processing
AMPP
Asymmetric MPP
( SMP + MPP)

Primary key => distribution key 
hub -< satellite join
- data redistribution
- join local in parallel
BK SID
Ensemble 1
Dimensional 2
SID LDTS INFO
1 2001-01-01 My first DV
1 2014-06-05 DV Masters
2 1997-08-02 DM manifesto
Node 1
Node 2

Hub SID => distribution key 
hub -< satellite join
BK SID
Ensemble 1
Dimensional 2
SID LDTS INFO
1 2001-01-01 First DV
1 2014-06-05 DV Masters
2 1997-08-02 DM manifesto
Node 1
Node 2

Link SID => distribution key 
Default L_SID, 1:N & N:M
- data redistribution
H_MID H_SID L_SID
1 A 1
1 B 2
L_SID LDTS LDTS_END CURRENT
1 2001-01-01 2006-01-01 N
1 2014-06-05 9999-12-31 Y
2 2006-01-01 2014-06-05 N
H_MID H_SID L_SID
1 A 1
1 B 2
L_SID H_MID H_SID LDTS LDTS_END
1 1 A 2001-01-01 2006-01-01
1 1 B 2014-06-05 9999-12-31
2 1 A 2006-01-01 2014-06-05
1:N => H_MID on link satellite
H_MID is the ensemble identifier !
Node 1
Node 2

Use the ensemble identifier if possible!
H_SID H_SID LDTS INFO
L_SID? H_SID H_MID H_SID ? L_SID ? LDTS INFO
Distributing data efficiently to ensure good
performance in a MPP database.
- If uneven distribution, one node may become a
bottleneck for the whole execution
Try to minimize data movement between nodes
- Data redistribution may occur when joining tables
Ensemble

Presenter:
Date:
Note:
Company:
eMail:
Twitter:
Remco Broekmans
June 5, 2014
Example for ReConnect
Coarem
Remco@Coarem.nl
RemcoBroekmans

SAP #Hana is a column store #database which
brings #efficiency in storage and access - #in-
memory.

SAP #Hana seems to benefit on their technical
#architecture in using 1 broad Satellite per
#Hub - #benefit no need for #PIT, less tables

Splitting #Sat’s in #rate-of-change as efficient
in storage as column store
#multiple Sat’s to prefer if data coming from
multiple sources (#write efficiency)

#referential join will only perform the join if
data from the joined tables is used create 1
#PIT per #Hub (not as #SQL view)

#Lesson: DV is #efficient way of storing data
#Lesson: #SQL views can’t be read by Hana
Studio
#Lesson: #Hana is still evolving

Data Vault ReConnect Speed Presenting PM Part Three

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (15)

Similaire à Data Vault ReConnect Speed Presenting PM Part Three

Similaire à Data Vault ReConnect Speed Presenting PM Part Three (20)

Dernier

Dernier (20)

Data Vault ReConnect Speed Presenting PM Part Three