Doug Needham Data Vault presentation discussing Mathematical interpretation of links and the application of that to business modeling. Volumetrics and some general principles for Data Vault best practices
2. Who are we?
CLEAR MEASURES offers a range of services and solutions designed to
satisfy needs shared by firms large and small; and the skills required to make
your customized goals a reality. If your goals aren’t yet defined, CLEAR
MEASURES can help you define a strategy for managing, analyzing, or
visualizing your data in ways that make your path easier to identify.
• Analytics and Intelligence
• Data Integration
• Enterprise Architecture
• Strategic & Project Management
• Cloud Infrastructure
• Database Administration
• System Administration
• Technology Services
3. Who are we?
All our customers have access to:
Capacity
Pay on demand, with 15 minute increments, not the half-day or full-day you pay
for a contractor.
Coverage
True 24 X 7 Coverage, with in-facility staff directed from our Global Operations
Center in Covington, Kentucky.
Cost
CLEAR MEASURES can help your team with effective costs from Rural Sourcing
and Global Sourcing locations. CLEAR MEASURES proprietary ONguard system
allows for complete direction of a global workforce with U.S. oversight, focused on
efficiency and repeatability.
4. Who am I?
• The Data Guy
• 1st job was Marine Corps DBA supporting the Entire Marine
Corps at the main site for Systems Software Evaluation.
• First 10 years of my career DBA.
• 20 years of data management.
• Most recent decade building analytical systems.
• Pentaho, Informatica, Business Objects, Cognos, Oracle, SQL
Server, MySQL.
• Cloud based Analytics with a large healthcare information
company on Cassandra.
• Trying to figure out where Data Science, and Big Data fit
together with the Data Warehouse.
5. This is the wrong time for
Data Science
• It is also the wrong time for a Data Warehouse, Business
Intelligence Platform, Data Vault, Data Mining, Big Data, or any
other predictive, machine learning, analytics platform.
• Do these projects when things are going well. Anticipate what
could happen to prevent things from going poorly.
6. When is the right time?
• If you have multiple systems you need to integrate.
• As you lay the foundation for Self Service Business
Intelligence.
• To lay the foundation of Data as a service application.
• If you are combining data from many applications,
systems, or business units, or you are providing data to
many applications, systems, or business units that want
data provided to them in slightly different standard feeds.
7. Data Science and The Data
Warehouse
• “Data Science is the application of statistical and
mathematical rigor to business data.” Doug
• I have heard it said 80% of data science is data munging.
• Data Vault is: “100% of the data 100% of the time” – Dan
L.
• What does this mean?
• What does the data say? Where did the data come from?
What happened to the data from the time it was captured
until the time it was presented?
• Models, Statistical Models specifically, are the core of
Data Science.
• Looking forward to hearing more about DV 2.0 and how it
supports Polyglot persistence.
8. Data Science and The Data
Warehouse
• By the way, we have been doing this for a while.
• Some data is predictive, All data is instructive.
• Being able to create a statistical model, quickly run lots
of data through that statistical model, observe the actual
results and compare these with predicted results allows
us to refine the statistical model.
• Are Business analysts Data Scientists? What is the main
differential between the two?
• Which one “needs” more data? Which one can actually
use more data?
9. Quick Trivia
• Who was one of the first Data Scientist?
•
• Now let’s talk about storing all of this data we collect, and
see if there is anything new with our understanding of the
structures we are all familiar with.
10. Data Vault
• The integration layer of an overall data warehouse strategy.
• There are other areas of data warehousing.
• Presentation
• Near-Line
• Archive
• Applications within the enterprise are the data capture
mechanisms.
• I think everyone is trying to find the best way to leverage a “Big
Data” platform into the world of the Data Warehouse.
• Data vault is the mechanism that allows a data warehouse to
evolve over time.
• Simple, straightforward, repeatable, auditable, resilient.
11. Modeling
• HUBs – Business Keys
• LNKs -Relationships
• SATs – Contextual data.
• There are other entities of the Data Vault
methods, however, these are the primary entities.
Everything else is functionally dependent on some
combination of the above.
• Notice the colors, Hubs one color, Links another, Sats a
third. Anything else should be a separate color.
12. HUB
• Business Keys.
• Isolated entities that can stand alone representing a list
of unique business keys.
• The collection of business keys for an organization is the
answer to the question, “What do we do?”
• Which business key is most important?
• How many edges does it have?
13. LNK
• Relationships.
• Isolated entities that can stand alone representing a list
of unique business keys.
• The collection of relationships for an organization is the
answer to the question, “At what time does whom do
what to whom or what?”
• Links are actually very interesting in their own right. We
will be speaking further about links specifically a little
later in this session.
•
14. LNK
• How many edges does a link have? The number of
incoming edges a Link table has is the number of
HUB_SQNs the link is connecting (This includes weak
hubs).
• Outgoing Edges are the number of Satellites connected
to this Link table.
• What is the ratio of OE/IE?
17. Now What?
• Now that I have these numbers, what do I do with them?
• This is one way to confirm the accuracy of the
sequencing of your business keys in a link, in order to
separate out the driver business key from the dependent
keys?
• Are there any other links in the Data Vault that have a
similar Cosine?
18. Now What?
• If you have cosine similarity between links does this
mean something?
• What is going on in the business? Is it obvious the links
are related?
• More importantly, is it not obvious why two links are
similar within a margin of error?
19. SAT
• Contextual data.
• Detail data. Most pertinent for loading use in downstream
systems.
• The “Payload” of the satellite is the data you want to
capture.
• The collection of business keys for an organization is the
answer to the question “What do we do?”
• Has one edge.
20. Satellite Clustering
• Using some simple k-means clustering with Euclidean
distance calculations you can identify divergent rates of
change within a satellite.
• This is one way to divaricate satellites coming from a
single source table.
• If you are interested in knowing more about this, let me
know.
21. Philosophies
• From Dan: “100% of the data 100% of the time”
• From Doug: “A model is not valid, until 100% of the
model is populated from source systems.”
• Notice I did not say 100% of the data as Dan did.
• During development, the assumptions built into the
model have to be validated.
• Designing a proper data vault model does not take very
long for those versed in its abilities. Loading the model to
validate the assumptions built into the model is
paramount to success.
22. Philosophies
• The second portion of this philosophy is to extract data
from the Vault to an alternative system, be that star
schema, statistical research, data science, excel, etc.
Something Downstream needs to be populated FROM
the vault
• In order to know you have a valid model, data must both
go in and come out accurately according to business
rules.
• This must be done in order to say a particular phase of
the development cycle is complete.
• What does complete mean? It means this is the end of
the beginning. Welcome to the world of Data Warehouse
support, maintenance and evolution.
23. Aesthetics
• One of the most fascinating things about a data vault
model - to me - is that it flows quite aesthetically in
accordance with the particular business processes the
data vault is attempting to model.
• It just makes sense to a variety of users, from technical
to executive.
• The following slide is an example of this, where we are
modeling a process and something surprising came out
of the modelling exercise.
24. What do I mean by
Aesthetics?
• Can you do this with another data modeling technique?
25. Architecture
• A data architect understands applications are only the
entry point of data into the Enterprise. Data Science
makes data forever useful.
27. Summary
• One of the main reasons Architects are constantly studying
designs is they are continuously looking for ways not just
to create something new, but to reduce new problems to
ones already solved. The same thing can be said for
Mathematicians, Engineers, Physicists, even managers
and executives.
• The Data Vault is a repeatable pattern for database design
when that database is to be used for integration of multiple
systems. There are many other uses for Data Vault, of
course, but this is the first principle of why the data vault
exists.
• As we learn from prior implementations, be they our
own, or from someone else, let us continuously strive to
not only reduce problems to those already solved but look
for, and discuss these repeatable patterns of Data Vault
design.
28. Final thoughts
• With the Data Vault, the structure itself has meaning.
• This is a feature that I believe is unique to Data Vault
modeling.
• Our email contact information:
• dneedham@clearmeasures.com
• pdokouzov@clearmeasures.com
Notes de l'éditeur
Keplerwas the first Data Scientist because Brahe had collected and stored many years of observations (data), yet he had no way of interpreting it accurately until Kepler studied the data and came up with is laws for planetary motions. Kepler came up with an accurate model that not only explained the observations of Brahe, but also predicted future observations.
Keplerwas the first Data Scientist because Brahe had collected and stored many years of observations (data), yet he had no way of interpreting it accurately until Kepler studied the data and came up with is laws for planetary motions. Kepler came up with an accurate model that not only explained the observations of Brahe, but also predicted future observations.