PPDM data loading best practice

Data Loading
Best Practice
PPDM Association
2012 Data
Symposium

www.etlsolutions.com

Agenda

Best practices in
Data loading
tools and
challenges
methodology

We’ll be taking a look at loading data into PPDM, but
much of this applies to generic data loading too

Agenda

Best practices in
Data loading
tools and
challenges
methodology

We’ve been listening to the Data Manager’s perspective

• PPDM Conference,
Houston
• PNEC, Houston
• Data Managers’
challenges:
• Education
• Certification
• Preserving knowledge
• Process

Data management is difficult and important

Different data movement scenarios

Migration

Data

Integration Loading

But all require mapping rules for best practice

The business view of data migration can be an issue

• Often started at the end of a programme
• Seen as a business issue (moving the filing
cabinet), not technical
• However, the documents in the filing cabinet
need to be read, understood, translated to the
new system; obsolete files need to be
discarded

Different data migration methodologies are available

PDM (Practical Data Providers
Migration) • Most companies
• Johny Morris providing data
• Training course migration
services/products
• PDM certification have a methodology
• Abstract • Ours is PDM-like, but
• V2 due soon more concrete and
less abstract

Agenda

Best practices in
Data loading
tools and
challenges
methodology

Methodology

As an example, our methodology

Project scoping
Core migration
Configuration

Landscape analysis Data assurance Migration design Migration development

Requirements analysis Data discovery Data review
Testing design Testing development
Data modelling Data cleansing

Execution

Review

Legacy decommissioning

Firstly, review the legacy landscape
Satellites

Archive SAP

Legacy

Report Application

Excel Access DB

VBA

Eradicate failure points
Beware the virtual waterfall process

Requirements Agile Development Migrate

Signoff

Agenda

Best practices in
Data loading
tools and
challenges
methodology

Rules

Rules are required

• In data migration, integration
or loading, one area of
commonality is the link
between source and target
• This requires design,
definition, testing,
implementation and PPDM

documentation
• The aim is automated loading
of external data into a common
store
• This requires best practice

Best practice: A single version of truth

• So for each of these data loaders
we want a single version of truth
• Whatever artifacts are required,
we want to remove duplication,
because duplication means
errors, inconsistency and
additional work
• We want to remove boiler plate
components that are only PPDM 3.8

indirectly related to the business
rules by which data is loaded
• Let’s look at what goes into a data
loader and where the duplication
and unnecessary work comes
from...

The PPDM physical model

• PPDM comes to us as a physical
projection, rather than a logical model –
maps directly to a relational database
• Access therefore via SQL, PL/SQL; low
level detail is important i.e. how
relationships are implemented (e.g.
well header to borehole)
• Considerations to access: primary
keys, foreign keys, data types –
conversions, maximum lengths. Load
order required by FKs – PPDM Load of
the rings, relationships – cardinality etc
• SQL: only know at runtime, so
turnaround can be slow
• All of this metadata is available in
machine readable format, so we should
use it

External data sources
• Looking at the external files, we need a
variety of skills: text manipulation, XML
processing, Excel, database
• The data model is unlikely to be as rich as
PPDM, but there is some definition of the
content e.g. Excel workbooks have a
tabular layout with column titles,
worksheets are named
• It can be hard to find people with the
relevant skills - you sometimes see ad PPDM 3.8
hoc, non-standard implementations
because the developer used whatever
skills he/she had: perl, python, xslt, sql
• So the next clue is that we should use the
model information: what elements,
attributes and relationships are defined,
rather than details of how we access it
• Abstract out the data access layer; don’t
mix data access with the business rules
required to move them into PPDM

Challenges with domain expert mapping rules
• A common step for defining how a data source is to be loaded is for a domain expert to
write it up in Excel
• Not concerned with data access, but some details will creep in, e.g. specifying an xpath
• When lookups, merging/splitting values, string manipulation, conditional logic appear,
the description can become ambiguous
• Also note the duplication: the model metadata is being written in the spreadsheet; if the
model changes, the spreadsheet needs to be manually updated

Challenges with developer mapping rules
• The example here probably wouldn’t pass a code inspection, but it does illustrate the
type of issues that can arise
• Firstly, duplication: this is reiterating the Excel rules – they need to match up, but while
a domain expert might follow the simple example previously, low level code can be
tricky to discuss
• Secondly, metadata is again duplicated: the names of the tables and columns appear in
the SQL statements, the max length of the name column is checked
• Thirdly, boiler plate code: select/update/insert conditional logic
• Fourthly, data access code appears in the rules
• Finally, the code becomes hard to maintain as the developer moves on to other roles

Documentation of mapping rules

 Word document for
sign-off
 Data Management
record
 How data was loaded
 Stored in your MDM
data store
 Can be queried
 PPDM mapping tables

Test artifacts

• Here is where you do require
some duplication
• Tests are stories:
• Define what the system
should do
• If it does, the system is good
enough if the tests are
complete
• If we use a single version of
truth to generate tests, the tests
will duplicate errors, not find
them

Agenda

Best practices in
Data loading
tools and
challenges
methodology

Tools

Use tools

• Use available metadata
• Abstract out data access layer
• Higher level DSL for the mapping
rules:
• Increase team communication
– developer/business
• Reduce boiler plate code
• One definition:
• Replace Excel and code
• Generate documentation

An example of a graphical tool: Altova MapForce
• Tools such as Talend, Mule DataMapper
and Altova MapForce take a predominantly
graphical approach
• The metadata loaded on the left and right
(source/target) with connecting lines
• In addition to the logic gates for more
complex processing, code snippets can be
added to implement most business logic
• Issues:
• Is it really very easy to read? The
example here is a simple mapping;
imagine PPDM well log curve,
reference data tables etc
• It isn’t easy to see what really happens:
a+b versus an “adder” – e.g. follow the
equal() to Customers – what does that
actually do?
• But: can generate documentation and
executable from that single definitive
mapping definition
• Typing errors etc are mostly eliminated

ETL Solutions’ Transformation Manager
• An alternative is to use a textual DSL: again
the metadata has been loaded
• No data access code
• Metadata is used extensively: for example
warnings, primary key for identification;
relationships
• Typing errors are checked at designtime,
and model or element changes affecting the
code are quickly detected e.g. PPDM 3.8 to
3.9
• Rels used to link transforms: a more logical
view with no need to understand underlying
constraints; complexity of the model doesn’t
matter, as the project becomes structured
naturally
• FK constraints used to determine load order
• Metadata pulled in directly from the source
e.g. PPDM, making use of all the hard work
put in by the PPDM Association

Keeping the PPDM data manager happy
One of the many questions a data manager
has about the data he/she manages:

Data lineage: How did this data get here?

PPDM 3.8

PPDM provides tables to record data lineage

Transformation Manager can generate documentation for the
PPDM metadata module

Agenda

Best practices in
Data loading
tools and
challenges
methodology

Project management

Key points

• Be aware
• Look at data migration
methodologies
• Select appropriate
components
• Look for and remove
large risky steps
• Start early
• Ensure correct
resources will be
available
• No nasty budget
surprises
• Use tools
• Build a happy virtual team

Questions

• Did you know about these
tables?
• Who uses them?
• How do you use them?
• What features would be truly
useful in a data loader tool?

Contact us for more information:
Karl Glenn, Business Development Director
kg@etlsolutions.com
+44 (0) 1912 894040

Read more on our website:
http://www.etlsolutions.com/what-we-do/oil-and-gas/

Raising data
management
standards
Images from Free Digital Photos freedigitalphotos.net

PPDM data loading best practice

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (19)

Plus de ETLSolutions

Plus de ETLSolutions (11)

Dernier

Dernier (20)

PPDM data loading best practice

Notes de l'éditeur