Build data quality rules and data cleansing into your data pipelines

•Télécharger en tant que PPTX, PDF•

0 j'aime•235 vues

Mark Kromer

How to build data quality and data cleansing in your data pipelines using Azure Data Factory

Technologie

Build data quality rules and data
cleansing into your data pipelines
Principal Product Manager
Microsoft Data Integration

Data Quality for Data Warehouse Scenarios
Verify data types
and lengths
How should I
handle NULLs?
Domain value
constraints (ie. US
states)
Single source of
truth (master
data)
Late arriving
dimensions
Reference data
lookups

Data Quality for Data Science Scenarios
De-Duplicate
Data
Descriptive data
statistics (Length,
type, mean, median,
average, stddev …)
Frequency
distribution
How do I handle
missing values?
Enumerations
(turn codes into
categorical data)
Value
replacement

Replacing Values
• iif (length(title) == 0,toString(null()),title)

Pattern Matching
With very wide datasets or datasets where you cannot
anticipate all column names, generate pattern-matching rules
to apply data quality rules en masse without the need to name
each column and describe each rule individually.

Enumerations /
Lookups
• Replace coded values with categorical
strings, or vice-versa
• Hard code rules inside Case() statement or
use a Join or Lookup method against a
reference file or table
• Models and analytics will sometimes require
codes or string literal values

Data De-Duplication
and Distinct Rows
• Use this pattern to eliminate common rows
from your data
• You pick a heuristic to use during duplicate
matching
• You can tag rows and/or remove duplicate
rows
• Use exact matching and/or fuzzy matching
• Available as pipeline template Dedupe
Pipeline

Fuzzy Lookups &
Joins
• Sometimes when performing
inline lookups, you don’t
have exact matches when
looking for references
• Fuzzy Lookups with Soundex
helps find matches based on
phonetic algorithms
• Very useful in data lake
scenarios where joins and
lookups are against data that
is not normalized or cleaned

Metadata
Validation Rules
• Transform data conditionally based on metadata traits
• Create and manage metadata quality rules
• Manipulate column properties of source data
• https://www.youtube.com/watch?v=E_UD3R-VpYE

Assertions
• Useful for building data quality pipelines
• Log failed assertions
• Fail fast
• Built-in expectations
• Expect true
• Expect unique
• Expect exists

Recommandé

Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer

Mapping Data Flows Training deck Q1 CY22Mark Kromer

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

Data Governance Best PracticesDATAVERSITY

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley

DataMinds 2022 Azure Purview Erwin de KreukErwin de Kreuk

Recommandé

Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer

Mapping Data Flows Training deck Q1 CY22Mark Kromer

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Enabling a Data Mesh Architecture with Data VirtualizationDenodo

Data Governance Best PracticesDATAVERSITY

NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley

DataMinds 2022 Azure Purview Erwin de KreukErwin de Kreuk

Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra

Data Mesh for DinnerKent Graziano

Data cleansing and prep with synapse data flowsMark Kromer

Is the traditional data warehouse dead?James Serra

Achieving Lakehouse Models with Spark 3.0Databricks

Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

Data Mesh using Microsoft FabricNathan Bijnens

Time to Talk about Data MeshLibbySchulze

DW Migration Webinar-March 2022.pptxDatabricks

Building an Effective Data Warehouse ArchitectureJames Serra

Modern Data architecture DesignKujambu Murugesan

Owning Your Own (Data) Lake HouseData Con LA

From Data Warehouse to LakehouseModern Data Stack France

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Lakehouse in AzureSergio Zenatti Filho

Etl And Data Test Guidelines For Large ApplicationsWayne Yaddow

The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA

Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY

Data Warehouse Basic Guidethomasmary607

Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury

Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...SaiM947604

Contenu connexe

Tendances

Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra

Data Mesh for DinnerKent Graziano

Data cleansing and prep with synapse data flowsMark Kromer

Is the traditional data warehouse dead?James Serra

Achieving Lakehouse Models with Spark 3.0Databricks

Data Warehouse or Data Lake, Which Do I Choose?DATAVERSITY

Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock

Data Mesh using Microsoft FabricNathan Bijnens

Time to Talk about Data MeshLibbySchulze

DW Migration Webinar-March 2022.pptxDatabricks

Building an Effective Data Warehouse ArchitectureJames Serra

Modern Data architecture DesignKujambu Murugesan

Owning Your Own (Data) Lake HouseData Con LA

From Data Warehouse to LakehouseModern Data Stack France

Data Lakehouse Symposium | Day 1 | Part 2Databricks

Lakehouse in AzureSergio Zenatti Filho

Etl And Data Test Guidelines For Large ApplicationsWayne Yaddow

The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA

Data Lake Architecture – Modern Strategies & ApproachesDATAVERSITY

Data Warehouse Basic Guidethomasmary607

Tendances (20)

Modern Data Warehousing with the Microsoft Analytics Platform System

Data Mesh for Dinner

Data cleansing and prep with synapse data flows

Is the traditional data warehouse dead?

Achieving Lakehouse Models with Spark 3.0

Data Warehouse or Data Lake, Which Do I Choose?

Data Mesh Part 4 Monolith to Mesh

Data Mesh using Microsoft Fabric

Time to Talk about Data Mesh

DW Migration Webinar-March 2022.pptx

Building an Effective Data Warehouse Architecture

Modern Data architecture Design

Owning Your Own (Data) Lake House

From Data Warehouse to Lakehouse

Data Lakehouse Symposium | Day 1 | Part 2

Lakehouse in Azure

Etl And Data Test Guidelines For Large Applications

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both

Data Lake Architecture – Modern Strategies & Approaches

Data Warehouse Basic Guide

Similaire à Build data quality rules and data cleansing into your data pipelines

Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury

Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...SaiM947604

Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATAMatt Stubbs

Data warehouse 17 dimensional data modelVaibhav Khanna

Creating a Data validation and Testing StrategyRTTS

Datawarehousing TerminologyDev EngineersSaathi

Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta

Building the enterprise data architectureCosta Pissaris

Digital dataShivanandaVSeeri

Digital TypesShivanandaVSeeri

Intro to Data ManagementChristopher Eaker

Data Profiling, Data Catalogs and Metadata HarmonisationAlan McSweeney

Data pre processingkalavathisugan

JOSA TechTalk: Metadata Management in Big DataJordan Open Source Association

Data Management Best PracticesChristopher Eaker

Unit 3 part i Data miningDhilsath Fathima

Combining Human+Machine Intelligence to Successfully Integrate Biomedical Datarusselltamr

Best practices data managementSherry Lake

Using ca e rwin modeling to asure data 09162010ERwin Modeling

Data preprocessing using Machine Learning Gopal Sakarkar

Similaire à Build data quality rules and data cleansing into your data pipelines (20)

Data Quality: A Raising Data Warehousing Concern

Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...

Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA

Data warehouse 17 dimensional data model

Creating a Data validation and Testing Strategy

Datawarehousing Terminology

Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!

Building the enterprise data architecture

Digital data

Digital Types

Intro to Data Management

Data Profiling, Data Catalogs and Metadata Harmonisation

Data pre processing

JOSA TechTalk: Metadata Management in Big Data

Data Management Best Practices

Unit 3 part i Data mining

Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data

Best practices data management

Using ca e rwin modeling to asure data 09162010

Data preprocessing using Machine Learning

Plus de Mark Kromer

Fabric Data Factory Pipeline Copy Perf Tips.pptxMark Kromer

Data cleansing and data prep with synapse data flowsMark Kromer

Mapping Data Flows Training April 2021Mark Kromer

Mapping Data Flows Perf Tuning April 2021Mark Kromer

Data Lake ETL in the Cloud with ADFMark Kromer

Azure Data Factory Data Wrangling with Power QueryMark Kromer

Azure Data Factory Data Flow Performance Tuning 101Mark Kromer

Data Quality Patterns in the Cloud with ADFMark Kromer

Azure Data Factory Data Flows Training (Sept 2020 Update)Mark Kromer

Data quality patterns in the cloud with ADFMark Kromer

Azure Data Factory Data Flows Training v005Mark Kromer

ADF Mapping Data Flows Level 300Mark Kromer

ADF Mapping Data Flows Training V2Mark Kromer

ADF Mapping Data Flows Training Slides V1Mark Kromer

ADF Mapping Data Flow Private Preview MigrationMark Kromer

Azure Data Factory ETL Patterns in the CloudMark Kromer

SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer

Azure Data Factory Data FlowMark Kromer

Azure Data Factory Data Flow Limited Preview for January 2019Mark Kromer

Microsoft Azure Data Factory Data Flow ScenariosMark Kromer

Plus de Mark Kromer (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx

Data cleansing and data prep with synapse data flows

Mapping Data Flows Training April 2021

Mapping Data Flows Perf Tuning April 2021

Data Lake ETL in the Cloud with ADF

Azure Data Factory Data Wrangling with Power Query

Azure Data Factory Data Flow Performance Tuning 101

Data Quality Patterns in the Cloud with ADF

Azure Data Factory Data Flows Training (Sept 2020 Update)

Data quality patterns in the cloud with ADF

Azure Data Factory Data Flows Training v005

ADF Mapping Data Flows Level 300

ADF Mapping Data Flows Training V2

ADF Mapping Data Flows Training Slides V1

ADF Mapping Data Flow Private Preview Migration

Azure Data Factory ETL Patterns in the Cloud

SQL Saturday Redmond 2019 ETL Patterns in the Cloud

Azure Data Factory Data Flow

Azure Data Factory Data Flow Limited Preview for January 2019

Microsoft Azure Data Factory Data Flow Scenarios

Dernier

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Sample pptx for embedding into website for demoHarshalMandlekar2

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Time Series Foundation Models - current state and future directionsNathaniel Shimoni

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

unit 4 immunoblotting technique complete.pptxBkGupta21

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Artificial intelligence in cctv survelliance.pptxhariprasad279825

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Rise of the Machines: Known As Drones...Rick Flair

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx

DSPy a system for AI to Write Prompts and Do Fine Tuning

Sample pptx for embedding into website for demo

WordPress Websites for Engineers: Elevate Your Brand

Time Series Foundation Models - current state and future directions

The Ultimate Guide to Choosing WordPress Pros and Cons

unit 4 immunoblotting technique complete.pptx

What is DBT - The Ultimate Data Build Tool.pdf

DevoxxFR 2024 Reproducible Builds with Apache Maven

Ensuring Technical Readiness For Copilot in Microsoft 365

Artificial intelligence in cctv survelliance.pptx

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Nell’iperspazio con Rocket: il Framework Web di Rust!

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Rise of the Machines: Known As Drones...

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

TeamStation AI System Report LATAM IT Salaries 2024

Moving Beyond Passwords: FIDO Paris Seminar.pdf

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

Build data quality rules and data cleansing into your data pipelines

1. Build data quality rules and data cleansing into your data pipelines Principal Product Manager Microsoft Data Integration

2. Data Quality for Data Warehouse Scenarios Verify data types and lengths How should I handle NULLs? Domain value constraints (ie. US states) Single source of truth (master data) Late arriving dimensions Reference data lookups

3. Data Quality for Data Science Scenarios De-Duplicate Data Descriptive data statistics (Length, type, mean, median, average, stddev …) Frequency distribution How do I handle missing values? Enumerations (turn codes into categorical data) Value replacement

4. Replacing Values • iif (length(title) == 0,toString(null()),title)

5. Splitting Data Based on Values

6. Data Profiling • Summary statistics describing the shape, size, content of your data • Helps you to understand what is inside your data • As a Data Engineer, you have the responsibility to provide proper data for analytics and models • For big data sets, use sampling / good enough approach • View data statistics at any step in your data transformation • Here is how to persist your data profile stats: https://techcommunity.microsoft.com/t5/azure- data-factory/how-to-save-your-data-profiler- summary-stats-in-adf-data-flows/ba-p/1243251

7. Pattern Matching With very wide datasets or datasets where you cannot anticipate all column names, generate pattern-matching rules to apply data quality rules en masse without the need to name each column and describe each rule individually.

8. Enumerations / Lookups • Replace coded values with categorical strings, or vice-versa • Hard code rules inside Case() statement or use a Join or Lookup method against a reference file or table • Models and analytics will sometimes require codes or string literal values

9. Data De-Duplication and Distinct Rows • Use this pattern to eliminate common rows from your data • You pick a heuristic to use during duplicate matching • You can tag rows and/or remove duplicate rows • Use exact matching and/or fuzzy matching • Available as pipeline template Dedupe Pipeline

10. Fuzzy Lookups & Joins • Sometimes when performing inline lookups, you don’t have exact matches when looking for references • Fuzzy Lookups with Soundex helps find matches based on phonetic algorithms • Very useful in data lake scenarios where joins and lookups are against data that is not normalized or cleaned

11. Metadata Validation Rules • Transform data conditionally based on metadata traits • Create and manage metadata quality rules • Manipulate column properties of source data • https://www.youtube.com/watch?v=E_UD3R-VpYE

12. Assertions • Useful for building data quality pipelines • Log failed assertions • Fail fast • Built-in expectations • Expect true • Expect unique • Expect exists