SlideShare une entreprise Scribd logo
1  sur  15
Télécharger pour lire hors ligne
Role of Data cleaning in Data
Warehouse
Presentation on
Ramakant Soni
Assistant Professor, BKBIET, Pilani
ramakant.soni@bkbiet.ac.in
What is Data Warehouse ?
Data warehouse is an information delivery system where we can integrate and
transform data into information used largely for strategic decision making. The
historic data in the enterprise from various operational systems is collected and
is clubbed with other relevant data from outside sources to make integrated
data as content of data warehouse.
What is Data Cleaning ?
Data cleaning, also called data cleansing or scrubbing, deals with detecting and
removing errors and inconsistencies from data in order to improve the quality
of data.
 Introduction
RAMAKANT SONI, BKBIET
 Steps to build Data Warehouse: ETL Process
Figure 1. ETL Process
RAMAKANT SONI, BKBIET
 Need of Data Cleaning
• Data warehouses require and provide extensive support for data cleaning.
• They load and continuously refresh huge amounts of data from a variety of
sources so the probability of “dirty data” is high.
• Data warehouses are used for decision making, so the correctness of data
is vital to avoid wrong conclusions.
RAMAKANT SONI, BKBIET
 Requirements
A data cleaning approach should satisfy several requirements:
• Detect and remove all major errors and inconsistencies both in individual
data sources and when integrating multiple sources. The approach should
be supported by tools to limit manual inspection and programming effort.
• Data cleaning should not be performed in isolation but together with
schema-related data transformations based on comprehensive metadata.
• Mapping functions should be specified in a declarative way for data
cleaning and be reusable for other data sources as well as for query
processing.
• A workflow infrastructure should be supported to execute all data
transformation steps for multiple sources and large data sets in a reliable
and efficient way.
RAMAKANT SONI, BKBIET
 Data Quality Problems
RAMAKANT SONI, BKBIET
 Single-source problems
The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values.
• Sources without schema, such as files, have few restrictions on what data can be
entered and stored, giving rise to a high probability of errors and inconsistencies.
• Database systems, enforce restrictions of a specific data model (e.g., the relational
approach requires simple attribute values, referential integrity, etc.) as well as
application-specific integrity constraints.
Schema-Level problems occur because of the lack of appropriate model-specific or
application-specific integrity constraints.
Instance-Level problems relate to errors and inconsistencies that cannot be prevented
at the schema level (e.g., misspellings).
RAMAKANT SONI, BKBIET
 Example: Single Source Problem
RAMAKANT SONI, BKBIET
 Multi-source problems
The problems in single sources are aggravated when multiple sources are integrated.
Each source may contain dirty data and the data in the sources may be represented
differently, overlap or contradict because of the independent sources.
Result: Large degree of heterogeneity.
Problem in cleaning: To identify overlapping data, in particular matching records
referring to the same real-world entity. This problem is also referred to as the object
identity problem, duplicate elimination problem.
Frequently, the information is only partially redundant and the sources may
complement each other by providing additional information about an entity.
Solution: duplicate information should be purged out and complementing information
should be consolidated and merged in order to achieve a consistent view of real world
entities.
RAMAKANT SONI, BKBIET
 Example: Multi-Source Problem
Figure 2. Multi-Source problem example
RAMAKANT SONI, BKBIET
 Data cleaning Phases
In general, data cleaning involves several phases:
• Data analysis
• Definition of transformation workflow and mapping rules
• Verification
• Transformation
• Backflow of cleaned data
RAMAKANT SONI, BKBIET
 Data cleaning process
Data analysis & Defining
transformation workflow,
mapping rules
Verification &
Transformation
Backflow of
cleaned data
Figure 3. Data Cleaning Process
RAMAKANT SONI, BKBIET
 Data cleaning Tool support
Large variety of tools is available to support data transformation and data cleaning:
• Data analysis Tools
1. Data profiling tool Eg. MigrationArchitect( Evoke Software)
2. Data mining tool Eg. WizRule( WizSoft)
• Data reengineering tools uses discovered patterns and rules for cleaning.
Eg. Integrity( Vality Software)
• Specialized cleaning tools deal with Particular Domain
1. Special Domain Cleaning Eg. IDCentric( FirstLogic)
2. Duplicate Elimination Eg. MatchIt( HelpItSystems)
• ETL tools uses repository built on DBMS to manage all metadata about data sources,
target schema, mapping script etc. in uniform way
Eg. Extract( ETI), CopyManager( InformationBuilders)
RAMAKANT SONI, BKBIET
 References
1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do-
University of Leipzig
2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing -
Shridhar B. Dandin- BKBIET Pilani
3. Principles and methods of data cleaning- Arthur D. Chapman
RAMAKANT SONI, BKBIET
Thank You
RAMAKANT SONI, BKBIET

Contenu connexe

Tendances

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
Phi Jack
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
Abdul Aslam
 

Tendances (20)

Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
data mining
data miningdata mining
data mining
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Rule based system
Rule based systemRule based system
Rule based system
 
Data mining
Data mining Data mining
Data mining
 
Social Impacts & Trends of Data Mining
Social Impacts & Trends of Data MiningSocial Impacts & Trends of Data Mining
Social Impacts & Trends of Data Mining
 
Data Modeling PPT
Data Modeling PPTData Modeling PPT
Data Modeling PPT
 
Predicate logic
 Predicate logic Predicate logic
Predicate logic
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Encryption Standard (DES)
Data Encryption Standard (DES)Data Encryption Standard (DES)
Data Encryption Standard (DES)
 
Star ,Snow and Fact-Constullation Schemas??
Star ,Snow and  Fact-Constullation Schemas??Star ,Snow and  Fact-Constullation Schemas??
Star ,Snow and Fact-Constullation Schemas??
 
Cs8792 cns - unit i
Cs8792   cns - unit iCs8792   cns - unit i
Cs8792 cns - unit i
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Market Basket Analysis
Market Basket AnalysisMarket Basket Analysis
Market Basket Analysis
 
Knowledge representation In Artificial Intelligence
Knowledge representation In Artificial IntelligenceKnowledge representation In Artificial Intelligence
Knowledge representation In Artificial Intelligence
 
Ppt
PptPpt
Ppt
 
Predictive Modelling
Predictive ModellingPredictive Modelling
Predictive Modelling
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALADATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
DATA WAREHOUSE IMPLEMENTATION BY SAIKIRAN PANJALA
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 

En vedette

PTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB DesignPTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB Design
EMA Design Automation
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
Saeed Iqbal
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
Revolution Analytics
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
jagdish_93
 

En vedette (20)

Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
Data Cleaning Process
Data Cleaning ProcessData Cleaning Process
Data Cleaning Process
 
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data CleaningBrief Introduction to the 12 Steps of Evaluation Data Cleaning
Brief Introduction to the 12 Steps of Evaluation Data Cleaning
 
Ontology-driven KDD Process Composition
Ontology-driven KDD Process CompositionOntology-driven KDD Process Composition
Ontology-driven KDD Process Composition
 
14.machine learning
14.machine learning14.machine learning
14.machine learning
 
26.docking
26.docking26.docking
26.docking
 
7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene7 Step Data Cleanse: Salesforce Hygiene
7 Step Data Cleanse: Salesforce Hygiene
 
WEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek AhamedWEKA - A Data Mining Tool - by Shareek Ahamed
WEKA - A Data Mining Tool - by Shareek Ahamed
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
Data Quality - The Cleansing Process
Data Quality - The Cleansing ProcessData Quality - The Cleansing Process
Data Quality - The Cleansing Process
 
PTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB DesignPTC Live: Integrating PTC Windchill with Cadence PCB Design
PTC Live: Integrating PTC Windchill with Cadence PCB Design
 
An Introduction To Weka
An Introduction To WekaAn Introduction To Weka
An Introduction To Weka
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
An Introduction to Data Mining with R
An Introduction to Data Mining with RAn Introduction to Data Mining with R
An Introduction to Data Mining with R
 
Weka presentation
Weka presentationWeka presentation
Weka presentation
 
Presentation on Data Cleansing
Presentation on Data CleansingPresentation on Data Cleansing
Presentation on Data Cleansing
 
Datacube
DatacubeDatacube
Datacube
 
Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)Introduction to R for Data Mining (Feb 2013)
Introduction to R for Data Mining (Feb 2013)
 
Multidimentional data model
Multidimentional data modelMultidimentional data model
Multidimentional data model
 

Similaire à Role of Data Cleaning in Data Warehouse

Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
sumit621
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
Editor IJCATR
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
Costa Pissaris
 

Similaire à Role of Data Cleaning in Data Warehouse (20)

Dwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousingDwdm unit 1-2016-Data ingarehousing
Dwdm unit 1-2016-Data ingarehousing
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
Data Warehouse Basic Guide
Data Warehouse Basic GuideData Warehouse Basic Guide
Data Warehouse Basic Guide
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
 
The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
An Agile & Adaptive Approach to Addressing Financial Services Regulations and...
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Database :Introduction to Database System
Database :Introduction to Database SystemDatabase :Introduction to Database System
Database :Introduction to Database System
 
Intro.pptx
Intro.pptxIntro.pptx
Intro.pptx
 

Plus de Ramakant Soni

Plus de Ramakant Soni (13)

GATE 2021 Exam Information
GATE 2021 Exam InformationGATE 2021 Exam Information
GATE 2021 Exam Information
 
What is Algorithm - An Overview
What is Algorithm - An OverviewWhat is Algorithm - An Overview
What is Algorithm - An Overview
 
Internet of things
Internet of thingsInternet of things
Internet of things
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Huffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysisHuffman and Arithmetic coding - Performance analysis
Huffman and Arithmetic coding - Performance analysis
 
UML daigrams for Bank ATM system
UML daigrams for Bank ATM systemUML daigrams for Bank ATM system
UML daigrams for Bank ATM system
 
Collaboration diagram- UML diagram
Collaboration diagram- UML diagram Collaboration diagram- UML diagram
Collaboration diagram- UML diagram
 
Activity diagram-UML diagram
Activity diagram-UML diagramActivity diagram-UML diagram
Activity diagram-UML diagram
 
Sequence diagram- UML diagram
Sequence diagram- UML diagramSequence diagram- UML diagram
Sequence diagram- UML diagram
 
Class diagram- UML diagram
Class diagram- UML diagramClass diagram- UML diagram
Class diagram- UML diagram
 
Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2Use Case diagram-UML diagram-2
Use Case diagram-UML diagram-2
 
Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1Use Case diagram-UML diagram-1
Use Case diagram-UML diagram-1
 
UML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language IntroductionUML Diagrams- Unified Modeling Language Introduction
UML Diagrams- Unified Modeling Language Introduction
 

Dernier

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Dernier (20)

Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 

Role of Data Cleaning in Data Warehouse

  • 1. Role of Data cleaning in Data Warehouse Presentation on Ramakant Soni Assistant Professor, BKBIET, Pilani ramakant.soni@bkbiet.ac.in
  • 2. What is Data Warehouse ? Data warehouse is an information delivery system where we can integrate and transform data into information used largely for strategic decision making. The historic data in the enterprise from various operational systems is collected and is clubbed with other relevant data from outside sources to make integrated data as content of data warehouse. What is Data Cleaning ? Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.  Introduction RAMAKANT SONI, BKBIET
  • 3.  Steps to build Data Warehouse: ETL Process Figure 1. ETL Process RAMAKANT SONI, BKBIET
  • 4.  Need of Data Cleaning • Data warehouses require and provide extensive support for data cleaning. • They load and continuously refresh huge amounts of data from a variety of sources so the probability of “dirty data” is high. • Data warehouses are used for decision making, so the correctness of data is vital to avoid wrong conclusions. RAMAKANT SONI, BKBIET
  • 5.  Requirements A data cleaning approach should satisfy several requirements: • Detect and remove all major errors and inconsistencies both in individual data sources and when integrating multiple sources. The approach should be supported by tools to limit manual inspection and programming effort. • Data cleaning should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata. • Mapping functions should be specified in a declarative way for data cleaning and be reusable for other data sources as well as for query processing. • A workflow infrastructure should be supported to execute all data transformation steps for multiple sources and large data sets in a reliable and efficient way. RAMAKANT SONI, BKBIET
  • 6.  Data Quality Problems RAMAKANT SONI, BKBIET
  • 7.  Single-source problems The data quality of a source largely depends on the degree to which it is governed by schema and integrity constraints controlling permissible data values. • Sources without schema, such as files, have few restrictions on what data can be entered and stored, giving rise to a high probability of errors and inconsistencies. • Database systems, enforce restrictions of a specific data model (e.g., the relational approach requires simple attribute values, referential integrity, etc.) as well as application-specific integrity constraints. Schema-Level problems occur because of the lack of appropriate model-specific or application-specific integrity constraints. Instance-Level problems relate to errors and inconsistencies that cannot be prevented at the schema level (e.g., misspellings). RAMAKANT SONI, BKBIET
  • 8.  Example: Single Source Problem RAMAKANT SONI, BKBIET
  • 9.  Multi-source problems The problems in single sources are aggravated when multiple sources are integrated. Each source may contain dirty data and the data in the sources may be represented differently, overlap or contradict because of the independent sources. Result: Large degree of heterogeneity. Problem in cleaning: To identify overlapping data, in particular matching records referring to the same real-world entity. This problem is also referred to as the object identity problem, duplicate elimination problem. Frequently, the information is only partially redundant and the sources may complement each other by providing additional information about an entity. Solution: duplicate information should be purged out and complementing information should be consolidated and merged in order to achieve a consistent view of real world entities. RAMAKANT SONI, BKBIET
  • 10.  Example: Multi-Source Problem Figure 2. Multi-Source problem example RAMAKANT SONI, BKBIET
  • 11.  Data cleaning Phases In general, data cleaning involves several phases: • Data analysis • Definition of transformation workflow and mapping rules • Verification • Transformation • Backflow of cleaned data RAMAKANT SONI, BKBIET
  • 12.  Data cleaning process Data analysis & Defining transformation workflow, mapping rules Verification & Transformation Backflow of cleaned data Figure 3. Data Cleaning Process RAMAKANT SONI, BKBIET
  • 13.  Data cleaning Tool support Large variety of tools is available to support data transformation and data cleaning: • Data analysis Tools 1. Data profiling tool Eg. MigrationArchitect( Evoke Software) 2. Data mining tool Eg. WizRule( WizSoft) • Data reengineering tools uses discovered patterns and rules for cleaning. Eg. Integrity( Vality Software) • Specialized cleaning tools deal with Particular Domain 1. Special Domain Cleaning Eg. IDCentric( FirstLogic) 2. Duplicate Elimination Eg. MatchIt( HelpItSystems) • ETL tools uses repository built on DBMS to manage all metadata about data sources, target schema, mapping script etc. in uniform way Eg. Extract( ETI), CopyManager( InformationBuilders) RAMAKANT SONI, BKBIET
  • 14.  References 1. Data Cleaning: Problems and Current Approaches- Erhard Rahm, Hong Hai Do- University of Leipzig 2. Data cleaning, a problem that is redolent of Data Integration in Data Warehousing - Shridhar B. Dandin- BKBIET Pilani 3. Principles and methods of data cleaning- Arthur D. Chapman RAMAKANT SONI, BKBIET