2. Agenda
Best practices in
Data loading
tools and
challenges
methodology
We’ll be taking a look at loading data into PPDM, but
much of this applies to generic data loading too
3. Agenda
Best practices in
Data loading
tools and
challenges
methodology
4. We’ve been listening to the Data Manager’s perspective
• PPDM Conference,
Houston
• PNEC, Houston
• Data Managers’
challenges:
• Education
• Certification
• Preserving knowledge
• Process
6. Different data movement scenarios
Migration
Data
Integration Loading
But all require mapping rules for best practice
7. The business view of data migration can be an issue
• Often started at the end of a programme
• Seen as a business issue (moving the filing
cabinet), not technical
• However, the documents in the filing cabinet
need to be read, understood, translated to the
new system; obsolete files need to be
discarded
8. Different data migration methodologies are available
PDM (Practical Data Providers
Migration) • Most companies
• Johny Morris providing data
• Training course migration
services/products
• PDM certification have a methodology
• Abstract • Ours is PDM-like, but
• V2 due soon more concrete and
less abstract
9. Agenda
Best practices in
Data loading
tools and
challenges
methodology
Methodology
10. As an example, our methodology
Project scoping
Core migration
Configuration
Landscape analysis Data assurance Migration design Migration development
Requirements analysis Data discovery Data review
Testing design Testing development
Data modelling Data cleansing
Execution
Review
Legacy decommissioning
11. Firstly, review the legacy landscape
Satellites
Archive SAP
Legacy
Report Application
Excel Access DB
VBA
12. Eradicate failure points
Beware the virtual waterfall process
Requirements Agile Development Migrate
Signoff
13. Agenda
Best practices in
Data loading
tools and
challenges
methodology
Rules
14. Rules are required
• In data migration, integration
or loading, one area of
commonality is the link
between source and target
• This requires design,
definition, testing,
implementation and PPDM
documentation
• The aim is automated loading
of external data into a common
store
• This requires best practice
15. Best practice: A single version of truth
• So for each of these data loaders
we want a single version of truth
• Whatever artifacts are required,
we want to remove duplication,
because duplication means
errors, inconsistency and
additional work
• We want to remove boiler plate
components that are only PPDM 3.8
indirectly related to the business
rules by which data is loaded
• Let’s look at what goes into a data
loader and where the duplication
and unnecessary work comes
from...
16. The PPDM physical model
• PPDM comes to us as a physical
projection, rather than a logical model –
maps directly to a relational database
• Access therefore via SQL, PL/SQL; low
level detail is important i.e. how
relationships are implemented (e.g.
well header to borehole)
• Considerations to access: primary
keys, foreign keys, data types –
conversions, maximum lengths. Load
order required by FKs – PPDM Load of
the rings, relationships – cardinality etc
• SQL: only know at runtime, so
turnaround can be slow
• All of this metadata is available in
machine readable format, so we should
use it
17. External data sources
• Looking at the external files, we need a
variety of skills: text manipulation, XML
processing, Excel, database
• The data model is unlikely to be as rich as
PPDM, but there is some definition of the
content e.g. Excel workbooks have a
tabular layout with column titles,
worksheets are named
• It can be hard to find people with the
relevant skills - you sometimes see ad PPDM 3.8
hoc, non-standard implementations
because the developer used whatever
skills he/she had: perl, python, xslt, sql
• So the next clue is that we should use the
model information: what elements,
attributes and relationships are defined,
rather than details of how we access it
• Abstract out the data access layer; don’t
mix data access with the business rules
required to move them into PPDM
18. Challenges with domain expert mapping rules
• A common step for defining how a data source is to be loaded is for a domain expert to
write it up in Excel
• Not concerned with data access, but some details will creep in, e.g. specifying an xpath
• When lookups, merging/splitting values, string manipulation, conditional logic appear,
the description can become ambiguous
• Also note the duplication: the model metadata is being written in the spreadsheet; if the
model changes, the spreadsheet needs to be manually updated
19. Challenges with developer mapping rules
• The example here probably wouldn’t pass a code inspection, but it does illustrate the
type of issues that can arise
• Firstly, duplication: this is reiterating the Excel rules – they need to match up, but while
a domain expert might follow the simple example previously, low level code can be
tricky to discuss
• Secondly, metadata is again duplicated: the names of the tables and columns appear in
the SQL statements, the max length of the name column is checked
• Thirdly, boiler plate code: select/update/insert conditional logic
• Fourthly, data access code appears in the rules
• Finally, the code becomes hard to maintain as the developer moves on to other roles
20. Documentation of mapping rules
Word document for
sign-off
Data Management
record
How data was loaded
Stored in your MDM
data store
Can be queried
PPDM mapping tables
21. Test artifacts
• Here is where you do require
some duplication
• Tests are stories:
• Define what the system
should do
• If it does, the system is good
enough if the tests are
complete
• If we use a single version of
truth to generate tests, the tests
will duplicate errors, not find
them
22. Agenda
Best practices in
Data loading
tools and
challenges
methodology
Tools
23. Use tools
• Use available metadata
• Abstract out data access layer
• Higher level DSL for the mapping
rules:
• Increase team communication
– developer/business
• Reduce boiler plate code
• One definition:
• Replace Excel and code
• Generate documentation
24. An example of a graphical tool: Altova MapForce
• Tools such as Talend, Mule DataMapper
and Altova MapForce take a predominantly
graphical approach
• The metadata loaded on the left and right
(source/target) with connecting lines
• In addition to the logic gates for more
complex processing, code snippets can be
added to implement most business logic
• Issues:
• Is it really very easy to read? The
example here is a simple mapping;
imagine PPDM well log curve,
reference data tables etc
• It isn’t easy to see what really happens:
a+b versus an “adder” – e.g. follow the
equal() to Customers – what does that
actually do?
• But: can generate documentation and
executable from that single definitive
mapping definition
• Typing errors etc are mostly eliminated
25. ETL Solutions’ Transformation Manager
• An alternative is to use a textual DSL: again
the metadata has been loaded
• No data access code
• Metadata is used extensively: for example
warnings, primary key for identification;
relationships
• Typing errors are checked at designtime,
and model or element changes affecting the
code are quickly detected e.g. PPDM 3.8 to
3.9
• Rels used to link transforms: a more logical
view with no need to understand underlying
constraints; complexity of the model doesn’t
matter, as the project becomes structured
naturally
• FK constraints used to determine load order
• Metadata pulled in directly from the source
e.g. PPDM, making use of all the hard work
put in by the PPDM Association
27. Keeping the PPDM data manager happy
One of the many questions a data manager
has about the data he/she manages:
Data lineage: How did this data get here?
PPDM 3.8
30. Agenda
Best practices in
Data loading
tools and
challenges
methodology
Project management
31. Key points
• Be aware
• Look at data migration
methodologies
• Select appropriate
components
• Look for and remove
large risky steps
• Start early
• Ensure correct
resources will be
available
• No nasty budget
surprises
• Use tools
• Build a happy virtual team
32. Questions
• Did you know about these
tables?
• Who uses them?
• How do you use them?
• What features would be truly
useful in a data loader tool?
33. Contact us for more information:
Karl Glenn, Business Development Director
kg@etlsolutions.com
+44 (0) 1912 894040
Read more on our website:
http://www.etlsolutions.com/what-we-do/oil-and-gas/
Raising data
management
standards
www.etlsolutions.com
www.etlsolutions.com
Images from Free Digital Photos freedigitalphotos.net
Notes de l'éditeur
To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
How I came to be here giving this talkDmigration expert with ETL for 10+yearsMy first PPDM conference was last October in Houston. We're from a data loading, integration background, and it was a bit of an eye opener to listen to that data manager's view of things/PPDM, and it seemed to me that companies such as ourselves are part of the problem. One aspect of the conference that caught my attention was the education, certification and, for want of a better word, the professionalisation of petroleum data management - moving away from comparmentalised expertise within companies - I had the impression that one of the industry concerns is that a lot of this expertise is held in people's heads, and the experienced people looking forward more to retirement than their career. At PNEC, it was clear that some companies are very well advanced in this respect, for standards bodies like PPDM there is a lot of work to be done.There are similar rumblings within the data migration and integration industry, so today's talk will be an introduction to data migration best practices, then a quick look at how these link up with data management best practice.Went to pnec/ppdm, got data management perspective, lots of interest in certification, education, passing on the legacyAlso: focus on understanding the data - lineage, quality led me to think about how we as data loaders don’t really think about ongoing data management.[cause of/solution to]Hence talk on Data migration good practices/process
We’ll first look at 3 scenarios for moving data about
And yet cos it happens at end of programme, it is often started late and treated as a technical issue – moving the filing cabinet.-----DM happens at the end of the programme, just before the old system turns off, so it runs late and hence is usually started late into the programme.Possible exaggeration, but business view is a forklift truck picking up the filing cabinet and dropping it into the new system.Maybe a new filing cabinet and make sure all the papers are in there before movingIn reality, need to read, understand, translate the documents to new system, discard obsolete papers. Only business knows what’s important.
DM methodologies, certifications, training are sprouting up same way as with data managementWhat’s available?PDM – v2 coming, it’s a bit abstract.Project Management for Data Conversions and Data Management, Charles Scott
To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
And workflow parts.------------------A key to success is the old system being turned off. This gets people interested.Concentrate on Legacy Decommissioning. See what light it sheds on the process – turns out to be a lot.Sometimes a process can seem like a sea of required steps, hard to get people to buy into it.Focus on something everyone can agree with – the legacy system will be turned off – surprisingly often not considered in much detail, but it gives you leverage, grabs peoples attention if you can get them to believe you
WHAT: Need to know what you’re turning off.Programme: single vcersion of truth.Data quality.Model the systems and relationships you have – as you go out and start talking to the people who use these systems and how, you’ll discover more connections and systems including little empires, satellite systems used for operations you want to be brought into the new system. You’ll also discover bits of the legacy system nobody uses – no need to migrate.Landscape analysis. Typically data discovery done here, not data profiling.
Data migration is unusual in many ways. There are several teams of people involved, users, new system providers, old system experts, migration experts, project management, “business”.You write code that can correctly interpret the relevant information in the old system – moves and checks it to the new system – it’s complicatedAt the end of the programme, you run it once, it’s tested to ensure the new system will allow business to continue confidently, you don’t want them or you sleepless worriedThen you throw it away.There’s a lot of nostalgia around at the moment, you can buy your childhood memories on ebay, but if you’re looking for a 70s style waterfall methodology you’d be hard pushed to find services or product companies that don’t have the word agile in their process.Don’t despair, with data migration you can create you own waterfall method despite each stage being demonstrably agile.Describe diagramRequirements – light – just move the data. The some truly agile dev, maybe asking some reqs questions, unit testing, so on – let’s pretend that all the different bits are developed together and there are no nasty surprises when you try to join them up in your weeks worth of integration testing. Testers – they know how to write tests, right – so alongside the agile dev they’re writing tests, starting with the big pebbles and filling up the jar, let’s assume they test incrementally.then signoff – big step – security, data stakeholders, huge docsThen migrate – users hardly noticed you up til now, told to test, think of lots of fiendish lttle corner casesEg postcodeIf you have a nostalgic desire to follow a 70’s style waterfall process, you’d be hard pushed to find services or product companies that don’t have the word agile in their process, but don’t worry, in a DM project there are many ways to achieve it while using agile methods throughout.Signoff here might fail – big jump back at end of projectUsers might reject migration – bigger jump
To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
In the 3 scenarios we looked at, one area of commonality are the innocuous arrows linking the source and target.A great deal of work is involved in realising them – design, requirements, testing, implementation, documentation.What we really want is the implementation, the automated loading of external data into PPDM.let’s look at best practice for implementing these.
So for each of these data loaders, to relate back to a mantra we’ve heard in previous data management conferences, ideally we want a single version of truth; Whatever artifacts are required, we want to remove duplication, because duplication means errors, inconsistency, additional workWe want to remove boiler plate components that are only indirectly related to the business rules by which data is loaded.Let’s look at what goes into a data loader and where the duplication and unnecessary work comes from.--------------------------------As a data manager, looking at your single version of truth (logical, may be physically many databases), you want to be able to ask questions about that:Confidence in dataLegal questions – should we be looking at archived data, where is it, how did that data map to our current data view.Is there a link between instances of common errors – eg a problem with data loaded from a particular source.Looking back at data load, migration, integration, they have something in common: the arrows, or the rules by which data loaded.On the diagram they look nothing, but there is a lot of work that goes into them.Extend the single version of truth analogy, you want a single version of truth for each of these arrows. Look into them and they are generally very ad-hoc and poorly documented.Business rules: Excel – not well version controlled, woolly language – vague. Then implemented in code or using a graphical tool. Duplication. Difficult to communicate developer to domain expert.From a DM perspective, where are these rules. Well PPDM have done a great job in this respect by providing tables specifically for that.
PPDM comes to us as a physical projection, rather than a logical model – maps directly to a relational database.Access therefore via SQL, PL/SQL; low level detail is important ie how relationships are implemented eg well header to borehole.Considerations to access: primary keys, foreign keys, data types – conversions, maximum lengths. Load order required by FKs – PPDM Load of the rings, relationships – cardinality, etc.SQL: only know at runtime, so turnaround can be slowAll of this metadata is available in machine readable format, so we should use it -
Looking at the external files, we need a variety of skills: text manipulation, XML processing, Excel, database, etc.The data model is unlikely to be as rich as PPDM, but there is some definition of the content: the LAS 2.0 specficiation, Excel workbooks have a layout, eg tabular with column titles, worksheets are named, etc.It can be hard to find people with the relevant skills, and you can end up with some adhoc, non standardised implementations because the developer used whatever skills he had: perl, python, xslt, sql.So the next clue is that we should use the model information: what elements, attributes and relationships are defined, rather than details of how we access it:Abstract out the data access layer, don’t mix data access with the business rules required to move them into PPDM.
A common step for defining how a data source is to be loaded is for a domain expert to write it up in Excel.Not concerned with data access, but some details will creep in, egspecifiying an xpath.When lookups, merging / splitting values, string manipulation, conditional logic etc come in the description can become ambiguous.Also note the duplication: the model metadata is being written in to the spreadsheet; if the model changes, the spreadsheet needs to be manually updated.
A developer implements those rules in code. Above pseudo code shows typical things that are undesirable:First, duplication – this is reiterating the excel rules – they need to match up, but while a domain expert might follow the simple example above, low level code can be tricky to discuss.Second again: metadata is again duplicated – the names of the tables and columns appear in the SQL statements, the max length of the name column is checked. Explicit looping construct.Third boiler plate code: select/update/insert conditional logic.Fourth: data access code appears in the rules.I’ve made it explicit here and the code above probably wouldn’t pass a code inspection, but it does illustrate the type of duplication that can arise.In particular, the developer reads the specifications, knowledge stored in developers head, and regurgitated as code. Developer becomes valuable, code becomes hard to maintain as talented developer moves on.
Tools are a recognised best practice, they’re better than trying to do things by hand, eg hand coding migration scripts, workflow, profiling.But you do need skills in the toolsets.
To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
Graphical tools such as Talend, Mule DataMapper, and her AltovaMapForce take a predominantly graphical approach.You can see the metadata loaded in on the left and right (source/target), and lines connecting.In addition to the logic gates for more complex processing, in the background you can add code snippets to implement most business logic.Issues:is it really very easy to read (above is simple mapping, imagine PPDM well log curve, reference data tables etc).It isn’t easy to see what really happens: a+b versus an “adder” – eg follow the equal() to Customers – what does that actually do?But: can generate documentation and executable from that single definitive mapping definitionTyping errors etc are mostly eliminated.
An alternative is to use a textual DSL. Again you can see the metadata has been loaded[switch to TM live to enable interaction].No data access code.Metadata is used extensively – for example warnings, primary key for identification; relationships – cardinality, no explicit iteration. Typing errors checked at designtime. Model, element changes that affect the code quickly detected, egimagince PPDM 3.8 to 3.9.Rels used to link transforms – more logical view, no need to understand underlying constraints; complexity of model doesn’t matter, the project becomes structured naturally.FK constraints used to determine load order.Metadata pulled in directly from the source metadata, eg PPDM. Show comments; customisation. So, making use of all the hard work put in by PPDM.
From the same source you can generate code to execute the project
As a data manager, looking at your single version of truth (logical, may be physically many databases), you want to be able to ask questions about that:Confidence in dataLegal questions – should we be looking at archived data, where is it, how did that data map to our current data view.Is there a link between instances of common errors – eg a problem with data loaded from a particular source.Looking back at data load, migration, integration, they have something in common: the arrows, or the rules by which data loaded.On the diagram they look nothing, but there is a lot of work that goes into them.Extend the single version of truth analogy, you want a single version of truth for each of these arrows. Look into them and they are generally very ad-hoc and poorly documented.Business rules: Excel – not well version controlled, woolly language – vague. Then implemented in code or using a graphical tool. Duplication. Difficult to communicate developer to domain expert.From a DM perspective, where are these rules. Well PPDM have done a great job in this respect by providing tables specifically for that.
PPDM provides tables to allow us to record this - PPDM_SCHEMA, PPDM_TABLE, PPDM_COLUMN to describe the schemas and PPDM_MAP_RULE and PPDM_MAP_RULE_DETAIL to record the mappings. So, PPDM_SCHEMA doesn't just enable you to store details about your PPDM schema - it's pretty much the same as the Oracle catalog tables, with a few additions for recording units of measure, for example. So you can record there schema information about the legacy system, about XML schemas for example WITSML or PRODML. The PPDM_MAP tables let you record the mapping rules, so how a particular element or attribute goes from the source to the target schema and is stored in PPDM. This makes you the data manager happy, because you can query the database using your finely honed SQL skills to create reports to present to the business users who need to know this information - hopefully not the legal department.PPDM_MAP_RULE is used to contain lower level code, eg PL/SQL, python rules etc.It's hard to populate these tables though and their use is not standardised.
So: How about a tool, which can already generate documentation, generate the same for the PPDM metadata module.Above is a bit simplistic – hopefully PPDM_SYSTEM, PPDM_TABLE etc are already populated for the actual PPDM instance.And we only want to publish mappings when they actually used.Switch to demonstration to show TM prototype of how they can be populated.
You can store code but is it readable? Unit of code is usually a block, perhaps saying how to migrate an entire table.
So how about the tools used in data loading populate these tables automatically. Most, not all, have some metadata representation. If it's hand coding, then often the metadata is encapsulated in the code itself - the developer read the documentation, created the queries and updates which would run against the data stores, but apart from that you expect a tool to show you what you are moving data between.You'd also hope for a higher level representation of the mapping rules - lines, boxes, a domain specific language. Possibly a reporting capability.So what we did was to take our reporting capabilities, and look at how we can "report" to PPDM - export the metadata and mapping rules into the relevant module. I want to emphasise that we did an investigation, we don't have production code and we may have gotten some of the details wrong, but I'm going to fire up our tool, show you some simple mapping rules as developed, then show you what we populated in PPDM schema and mapping tables at the push of a button.
A typical tool – using ours cos I get a discount from our sales guy[better pic reqd]
To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
At the end of these initial phases, you might realise that the big bang approach is not going to work, and you need to change your approach significantly.It’s much better to realise this early on.Eg you might migrate bit by bit, maintaining both systems and using a data highway to move data between the old/new systems[dvla]
You need people to agree to the old system off, identify these people and bring them with youDecide what docs they will sign off – if you require design docs, these will need to be understood – are user stories better than low level business rules?When – so – 2 weeks before you run the final project you have all the docs ready for sign off – what do you do – hand them all over then for the stakeholders to sign off?These docs will be large and hard to understand, and you
WHEN: Need to ensure business continuity
Tools are a recognised best practice, they’re better than trying to do things by hand, eg hand coding migration scripts, workflow, profiling.But you do need skills in the toolsets.