SlideShare une entreprise Scribd logo
1  sur  48
Automating Data Warehouse
Patterns Through Metadata
Davide Mauri
dmauri@solidq.com
Davide Mauri
20 Years of experience on the SQL Server Platform
– Specialized in Data Solution Architecture, Database Design,
Performance Tuning, Business Intelligence, Data Warehouse, Big Data
& Analytics

Microsoft SQL Server MVP
President of UGISS (Italian SQL Server UG)
Mentor @ SolidQ
– Regular Speaker @ SQL Server events
– Projects, Consulting, Mentoring & Training

Find me here:
– Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx
– Twitter:@mauridb
Building a DWH in 2013
Is still a (almost) manual process
A *lot* of repetitive low-value work
No (or very few) standard tools available
How it should be
Semi-automatic process
– “develop by intent”

Define the mapping logic

CREATE DIMENSION Customer
FROM SourceCustomerTable
MAP USING CustomerMetadata
ALTER DIMENSION Customers
ADD ATTRIBUTE LoyaltyLevel
from a TYPE 1
semantic perspective
AS

– Source to Dimensions / Measures
• (Metadata anyone?)

CREATE FACT Orders
FROM SourceOrdersTable
MAP USING OrdersMetadata

Design the model and let the tool build it for you

ALTER FACT Orders
ADD DIMENSION Customer
The perfect BI process & architecture

Iterative!
Is automation possible?

DWH PROCESSES
Invest on Automation?
Faster development
– Reduce Costs
– Embrace Changes

Less bugs
Increase solution quality and make it consistent
throughout the whole product
Automation Pre-Requisites
Split the process to have two separate type of
processes
– What can be automated
– What can NOT be automated

Create and impose a set of rules that defines
– How to solve common technical problems
– How to implement such identified solutions
No Monkey Work!
Let the people think
and let the machines
do the «monkey» work.
Design Pattern
“A general reusable
solution to a commonly
occurring problem
within a given context”
Design Pattern
Generic ETL Pattern
– Partition Load
– Incremental/Differential Load

Generic BI Design Pattern
– Slowly Changing Dimension
• SCD1, SCD2, ecc.

– Fact Table
• Transactional, Snapshot, Temporal Snapshot
Design Pattern
Specific SQL Server Patterns
– Change Data Capture
– Change Tracking
– Partition Load
– SSIS Parallelism
Engineering the DWH
“Software Engineering
allows and require the
formalization of
software building and
maintenance process.”
Sample Rules
• Always put «last_update» column
• Always log Inserted/Updated/Deleted rows to
log.load_info table
• Use MD5 – binary(16) for checksums
• Use views to expose data
– Dimension & Fact views MUST use the same column
names for lookup columns
Engineering the DWH
There are two intrinsc
processes hidden in the
development of a BI
solution that must be
allowed (or forced) to
emerge.
Business Process
Data manipulation,
transformation, enrichment
& cleansing logic

Specific for every customer.
Almost not automatable
Technical Process
Application of data
extraction and loading
techniques
Recurring (pattern) in
any solution

Highly Automatable
Hi-Level Vision
Technical Process

Technical Process

ETL
OLTP

L

ET
STG

E

TL

Business Process

DWH
ETL Phases
«E» and «L» must be
– Simple, Easy and Straightforward
– Completely Automated
– Completely Reusable

«E» and «L» have ZERO value in a BI Solution
– Should be done in the most economic way
Well known solution to common problems

PATTERN
Source Full Load

E
Source Incremental Load
In this scenario,
“ID” is a IDENTITY/SEQUENCE.
Probably a PK.

E
Source Differential Load/1

In this scenario the source table
doesn’t offer any specific way to
Understand what’s changed

E
Source Differential Load/2

In this scenario the source table
has a TimeStamp-Like column

E
Source Differential Load

E

• SQL Server 2012 that can help with
incremental/differential load
– Change Data Capture
• Natively supported in SSIS 2012
• http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sqlserver-2012-2/

– Change Tracking
• Underused feature in BI…not so rich as CDC but MUCH more
simpler and easier
L

SCD 1 & SCD 2
Start

Lookup Dimension Id
and MD5 Checksum
From Business Key

Insert new members
into DWH

Calculate MD5
Checksum of NonSCD-Key Colums

Yes

Dimension Id is
Null?

No

Checksum are
different?

Yes

End

Merge data from
temp table to DWH

Store into temp
table
SCD 2 Special Note

L

• Merge => UPDATE Interval + INSERT New Row
FACT TABLE LOAD

L
Partition Load

EL
Parallel Load
• Logically split the work in several steps
– E.g: Load/Process one customer at time

• Create a «queue» table the stores information for each step
– Step 1 -> Load Customer «A»
– Step 2 -> Load Customer «B»

• Create a Package that
1. Pick the first not already picked up
2. Do work
3. Back to step 3

• Call the Package «n» times simultaneously

EL
Other SSIS Specific Patterns
• Range Lookup
– Not natively supported
– Matt Masson has the answer in his blog 
• http://blogs.msdn.com/b/mattm/archive/2008/11/25/l
ookup-pattern-range-lookups.aspx
A key ingredient in automation

METADATA
Metadata
Provide context information
– Which columns are used to build/feed a
Dimension?
– Which columns are Business Keys?
– Which table is the Fact Table?
– How Fact and Dimension are connected?
• Which columns are used?
How to manage Metadata?
• Naming Convention

• Extended Properties
• Specific, Ad Hoc Database or Tables
• Other (XML, File, ecc.)
Naming Convention
• The easiest and cheapest
–
–
–
–

No additional (hidden) costs
No need to be maintained
Never out-of-sync
No documentation need
• Actually, it IS PART of the documentation

– Imposes a Standard

• Very limited in terms of flexibility and usage
Extended Properties
Support most of metadata needs
No additional software needed

Very verbose usage
– Development of a wrapper to make usage simpler is
feasible and encouraged
Metadata Objects
Dedicated Ad-Hoc Database and Tables

As Flexible as you need
Maintenance Overhead to keep metadata in-sync with
data
– Development of automatic check procedure is needed
– DMV can help a lot here
External Metadata Objects
Really expensive to keep them in-sync
– A tool is needed, otherwise too much manual
work

Does not give any specific benefits with respect
to Ad-Hoc Database/Tables
DEMO
Let’s make it possible!

AUTOMATION
Automation Scenarios
• Run-Time: «Auto-Configuring» Packages
– Really hard to customize packages
– SSIS limitations must be managed
• Eg: Data Flow cannot be changed at runtime
• On-the fly creation of package may be needed

• Design-Time: Package Generators / Package Templates
– Easy to customize created packages
Automation Solutions
• Specific Tool/frameworks
– BIML / MIST

• SQL Server Platform
– SQL, PowerShell, .NET
– SMO, AMO
Package Generators
Required Assemblies
Microsoft.SqlServer.ManagedDTS
Microsoft.SqlServer.DTSRuntimeWrap
Microsoft.SqlServer.DTSPipelineWrap

Path:
C:Program Files (x86)Microsoft SQL
Server110SDKAssemblies
DEMO
Useful Resources
• «STOCK» Tasks:
– http://msdn.microsoft.com/enus/library/ms135956.aspx

• How to set Task properties at runtime:
– http://technet.microsoft.com/enus/library/microsoft.sqlserver.dts.runtime.executables
.add.aspx
BIML – BI Markup Language
• Developed by Varigence
– http://www.varigence.com
– http://bimlscript.com/
– MIST: BIML Full-Featured IDE

• Free via BIDS Helper
– Support “limited” to SSIS package generation
– http://bidshelper.codeplex.com
THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm

Contenu connexe

En vedette

Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Health Catalyst
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataCloudera, Inc.
 
Big Data = Bigger Metadata
Big Data = Bigger MetadataBig Data = Bigger Metadata
Big Data = Bigger MetadataIan White
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsDATAVERSITY
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodologyDatabase Architechs
 

En vedette (7)

Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?Data Lake vs. Data Warehouse: Which is Right for Healthcare?
Data Lake vs. Data Warehouse: Which is Right for Healthcare?
 
Enterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big DataEnterprise Data Hub: The Next Big Thing in Big Data
Enterprise Data Hub: The Next Big Thing in Big Data
 
Big Data = Bigger Metadata
Big Data = Bigger MetadataBig Data = Bigger Metadata
Big Data = Bigger Metadata
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
Master Data Management methodology
Master Data Management methodologyMaster Data Management methodology
Master Data Management methodology
 

Plus de Davide Mauri

Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartDavide Mauri
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data WarehousingDavide Mauri
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDavide Mauri
 
When indexes are not enough
When indexes are not enoughWhen indexes are not enough
When indexes are not enoughDavide Mauri
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureDavide Mauri
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveDavide Mauri
 
Azure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONAzure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONDavide Mauri
 
SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2Davide Mauri
 
SQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesSQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesDavide Mauri
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For DevelopersDavide Mauri
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsDavide Mauri
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningDavide Mauri
 
Dashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDavide Mauri
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsDavide Mauri
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsDavide Mauri
 
SQL Server 2016 JSON
SQL Server 2016 JSONSQL Server 2016 JSON
SQL Server 2016 JSONDavide Mauri
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveDavide Mauri
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BIDavide Mauri
 
AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)Davide Mauri
 
Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Davide Mauri
 

Plus de Davide Mauri (20)

Azure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstartAzure serverless Full-Stack kickstart
Azure serverless Full-Stack kickstart
 
Agile Data Warehousing
Agile Data WarehousingAgile Data Warehousing
Agile Data Warehousing
 
Dapper: the microORM that will change your life
Dapper: the microORM that will change your lifeDapper: the microORM that will change your life
Dapper: the microORM that will change your life
 
When indexes are not enough
When indexes are not enoughWhen indexes are not enough
When indexes are not enough
 
Building a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with AzureBuilding a Real-Time IoT monitoring application with Azure
Building a Real-Time IoT monitoring application with Azure
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Azure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSONAzure SQL & SQL Server 2016 JSON
Azure SQL & SQL Server 2016 JSON
 
SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2SQL Server & SQL Azure Temporal Tables - V2
SQL Server & SQL Azure Temporal Tables - V2
 
SQL Server 2016 Temporal Tables
SQL Server 2016 Temporal TablesSQL Server 2016 Temporal Tables
SQL Server 2016 Temporal Tables
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For Developers
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Dashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BIDashboarding with Microsoft: Datazen & Power BI
Dashboarding with Microsoft: Datazen & Power BI
 
Azure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applicationsAzure ML: from basic to integration with custom applications
Azure ML: from basic to integration with custom applications
 
Event Hub & Azure Stream Analytics
Event Hub & Azure Stream AnalyticsEvent Hub & Azure Stream Analytics
Event Hub & Azure Stream Analytics
 
SQL Server 2016 JSON
SQL Server 2016 JSONSQL Server 2016 JSON
SQL Server 2016 JSON
 
SSIS Monitoring Deep Dive
SSIS Monitoring Deep DiveSSIS Monitoring Deep Dive
SSIS Monitoring Deep Dive
 
Real Time Power BI
Real Time Power BIReal Time Power BI
Real Time Power BI
 
AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)AzureML - Creating and Using Machine Learning Solutions (Italian)
AzureML - Creating and Using Machine Learning Solutions (Italian)
 
Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)Datarace: IoT e Big Data (Italian)
Datarace: IoT e Big Data (Italian)
 

Dernier

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Dernier (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Automating DWH Patterns Through Metadata

  • 1.
  • 2. Automating Data Warehouse Patterns Through Metadata Davide Mauri dmauri@solidq.com
  • 3. Davide Mauri 20 Years of experience on the SQL Server Platform – Specialized in Data Solution Architecture, Database Design, Performance Tuning, Business Intelligence, Data Warehouse, Big Data & Analytics Microsoft SQL Server MVP President of UGISS (Italian SQL Server UG) Mentor @ SolidQ – Regular Speaker @ SQL Server events – Projects, Consulting, Mentoring & Training Find me here: – Blog: http://sqlblog.com/blogs/davide_mauri/default.aspx – Twitter:@mauridb
  • 4. Building a DWH in 2013 Is still a (almost) manual process A *lot* of repetitive low-value work No (or very few) standard tools available
  • 5. How it should be Semi-automatic process – “develop by intent” Define the mapping logic CREATE DIMENSION Customer FROM SourceCustomerTable MAP USING CustomerMetadata ALTER DIMENSION Customers ADD ATTRIBUTE LoyaltyLevel from a TYPE 1 semantic perspective AS – Source to Dimensions / Measures • (Metadata anyone?) CREATE FACT Orders FROM SourceOrdersTable MAP USING OrdersMetadata Design the model and let the tool build it for you ALTER FACT Orders ADD DIMENSION Customer
  • 6. The perfect BI process & architecture Iterative!
  • 8. Invest on Automation? Faster development – Reduce Costs – Embrace Changes Less bugs Increase solution quality and make it consistent throughout the whole product
  • 9. Automation Pre-Requisites Split the process to have two separate type of processes – What can be automated – What can NOT be automated Create and impose a set of rules that defines – How to solve common technical problems – How to implement such identified solutions
  • 10. No Monkey Work! Let the people think and let the machines do the «monkey» work.
  • 11. Design Pattern “A general reusable solution to a commonly occurring problem within a given context”
  • 12. Design Pattern Generic ETL Pattern – Partition Load – Incremental/Differential Load Generic BI Design Pattern – Slowly Changing Dimension • SCD1, SCD2, ecc. – Fact Table • Transactional, Snapshot, Temporal Snapshot
  • 13. Design Pattern Specific SQL Server Patterns – Change Data Capture – Change Tracking – Partition Load – SSIS Parallelism
  • 14. Engineering the DWH “Software Engineering allows and require the formalization of software building and maintenance process.”
  • 15. Sample Rules • Always put «last_update» column • Always log Inserted/Updated/Deleted rows to log.load_info table • Use MD5 – binary(16) for checksums • Use views to expose data – Dimension & Fact views MUST use the same column names for lookup columns
  • 16. Engineering the DWH There are two intrinsc processes hidden in the development of a BI solution that must be allowed (or forced) to emerge.
  • 17. Business Process Data manipulation, transformation, enrichment & cleansing logic Specific for every customer. Almost not automatable
  • 18. Technical Process Application of data extraction and loading techniques Recurring (pattern) in any solution Highly Automatable
  • 19. Hi-Level Vision Technical Process Technical Process ETL OLTP L ET STG E TL Business Process DWH
  • 20. ETL Phases «E» and «L» must be – Simple, Easy and Straightforward – Completely Automated – Completely Reusable «E» and «L» have ZERO value in a BI Solution – Should be done in the most economic way
  • 21. Well known solution to common problems PATTERN
  • 23. Source Incremental Load In this scenario, “ID” is a IDENTITY/SEQUENCE. Probably a PK. E
  • 24. Source Differential Load/1 In this scenario the source table doesn’t offer any specific way to Understand what’s changed E
  • 25. Source Differential Load/2 In this scenario the source table has a TimeStamp-Like column E
  • 26. Source Differential Load E • SQL Server 2012 that can help with incremental/differential load – Change Data Capture • Natively supported in SSIS 2012 • http://www.mattmasson.com/2011/12/cdc-in-ssis-for-sqlserver-2012-2/ – Change Tracking • Underused feature in BI…not so rich as CDC but MUCH more simpler and easier
  • 27. L SCD 1 & SCD 2 Start Lookup Dimension Id and MD5 Checksum From Business Key Insert new members into DWH Calculate MD5 Checksum of NonSCD-Key Colums Yes Dimension Id is Null? No Checksum are different? Yes End Merge data from temp table to DWH Store into temp table
  • 28. SCD 2 Special Note L • Merge => UPDATE Interval + INSERT New Row
  • 31. Parallel Load • Logically split the work in several steps – E.g: Load/Process one customer at time • Create a «queue» table the stores information for each step – Step 1 -> Load Customer «A» – Step 2 -> Load Customer «B» • Create a Package that 1. Pick the first not already picked up 2. Do work 3. Back to step 3 • Call the Package «n» times simultaneously EL
  • 32. Other SSIS Specific Patterns • Range Lookup – Not natively supported – Matt Masson has the answer in his blog  • http://blogs.msdn.com/b/mattm/archive/2008/11/25/l ookup-pattern-range-lookups.aspx
  • 33. A key ingredient in automation METADATA
  • 34. Metadata Provide context information – Which columns are used to build/feed a Dimension? – Which columns are Business Keys? – Which table is the Fact Table? – How Fact and Dimension are connected? • Which columns are used?
  • 35. How to manage Metadata? • Naming Convention • Extended Properties • Specific, Ad Hoc Database or Tables • Other (XML, File, ecc.)
  • 36. Naming Convention • The easiest and cheapest – – – – No additional (hidden) costs No need to be maintained Never out-of-sync No documentation need • Actually, it IS PART of the documentation – Imposes a Standard • Very limited in terms of flexibility and usage
  • 37. Extended Properties Support most of metadata needs No additional software needed Very verbose usage – Development of a wrapper to make usage simpler is feasible and encouraged
  • 38. Metadata Objects Dedicated Ad-Hoc Database and Tables As Flexible as you need Maintenance Overhead to keep metadata in-sync with data – Development of automatic check procedure is needed – DMV can help a lot here
  • 39. External Metadata Objects Really expensive to keep them in-sync – A tool is needed, otherwise too much manual work Does not give any specific benefits with respect to Ad-Hoc Database/Tables
  • 40. DEMO
  • 41. Let’s make it possible! AUTOMATION
  • 42. Automation Scenarios • Run-Time: «Auto-Configuring» Packages – Really hard to customize packages – SSIS limitations must be managed • Eg: Data Flow cannot be changed at runtime • On-the fly creation of package may be needed • Design-Time: Package Generators / Package Templates – Easy to customize created packages
  • 43. Automation Solutions • Specific Tool/frameworks – BIML / MIST • SQL Server Platform – SQL, PowerShell, .NET – SMO, AMO
  • 45. DEMO
  • 46. Useful Resources • «STOCK» Tasks: – http://msdn.microsoft.com/enus/library/ms135956.aspx • How to set Task properties at runtime: – http://technet.microsoft.com/enus/library/microsoft.sqlserver.dts.runtime.executables .add.aspx
  • 47. BIML – BI Markup Language • Developed by Varigence – http://www.varigence.com – http://bimlscript.com/ – MIST: BIML Full-Featured IDE • Free via BIDS Helper – Support “limited” to SSIS package generation – http://bidshelper.codeplex.com
  • 48. THANK YOU! • For attending this session and PASS SQLRally Nordic 2013, Stockholm

Notes de l'éditeur

  1. http://chartporn.org/2012/05/10/repetitive-tasks/
  2. http://en.wikipedia.org/wiki/Software_design_pattern
  3. http://en.wikipedia.org/wiki/Software_design_pattern
  4. http://en.wikipedia.org/wiki/Software_design_pattern
  5. Matt Masson Blog: http://blogs.msdn.com/b/mattm/archive/2008/11/25/lookup-pattern-range-lookups.aspx