SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Target Corporation




   BI Framework
   Error Processing
   Mohan.Kumar2
Table of Contents

1.     Exception Handling Overview (ref 2.5.2) ....................................................................................... 3
     1.1.     Data Reprocessing......................................................................................................................... 5
     1.2.     Infrastructure Exception Handling ................................................................................................ 7
     1.3.     Data Correction in DWH................................................................................................................ 9
2.     Error Processing – High Level .............................................................................................................. 11
     2.1.     Capturing..................................................................................................................................... 11
     2.2.     Error threshold ............................................................................................................................ 11
     2.3.     Purging ........................................................................................................................................ 12
       2.3.1.        Landing Area ....................................................................................................................... 12
       2.3.2.        Staging Area ........................................................................................................................ 12
       2.3.3.        EDW..................................................................................................................................... 12
       2.3.4.        Datamart ............................................................................................................................. 12
     2.4.     Purge threshold........................................................................................................................... 12
     2.5.     Appendix ..................................................................................................................................... 12
       2.5.1.        About Target ....................................................................................................................... 12
       2.5.2.        Reference ............................................................................................................................ 13
       2.5.3.        Other Contributors.............................................................................................................. 13




                                                                                                                                              Page 2 of 13
1. Exception Handling Overview (ref 2.5.2)




Exception Handling deals with any abnormal termination, unacceptable event or incorrect data that
can impact the data flow or accuracy of data in the warehouse/mart.

Exceptions in ETL could be classified as Data Related Exceptions and Infrastructure Related
Exceptions.


Please Note: In Infrastructure Related exception, Infrastructure glitches are not classified as exception
as they are temporary and are resolved by the time the job(s) is/are rerun. But, the logs are tracked
and maintained.

The process of recovering or gracefully exiting when an exception occurs is called exception handling.




                                                                                            Page 3 of 13
Data related exceptions are caused because of incorrect data format, incorrect value, incomplete
data from the source system. This leads to Data validation exceptions and Data Rejects. The process of
handling the Data Rejects is called Data Reprocessing.



                                                                                          Page 4 of 13
Infrastructure related exceptions are caused because of issues in the Network , the Database and the
Operating System. Common Infrastructure exceptions are FTP failure, Database connectivity failure,
File system full etc.

The data related exceptions are usually documented in the requirements, if not they must be because
if the data related exceptions are not handled they lead to inaccurate data in the warehouse/mart. We
also keep a threshold of maximum number of validation or reject failures allowed per load. Any value
above the threshold would mean the data would be too inaccurate because to too many rejections.

There is one more exception which is the presence of inaccurate or incorrect data in the warehouse.
This could happen due to

    1.   Incorrect requirement or missed, leading to incorrect ETL.
    2.   Incorrect interpretation of requirements leading to incorrect ETL.
    3.   Uncaught coding defects.
    4.   Incorrect data from source.

The process of Correction of the data already loaded in the warehouse involves fixing the data already
loaded and also preventing the inaccuracy to persist in the future.




    1.1.         Data Reprocessing

Reprocessing is is an exception handling process which involves the correction of the data that is could
                                                         not be loaded into the warehouse/mart.

                                                          There could be many reasons why source data
                                                          gets rejected from DWH. Most common of
                                                          them are

                                                                   Data Rejection - Source data not
                                                          matching critical business codes/attributes.
                                                          This is called Lookup Failure in ETL.
                                                                   Data Cleansing - Source data
                                                          containing junk values for business critical
                                                          fields hence getting rejected during data
                                                          validation.

                                                          There are 3 options to deal with the rejected
                                                          records. One, We could leave the rejected
                                                          data out of DWH or, two we could correct it
                                                          based on whether the rejected field is critical
                                                          to business and is worth reprocessing, and
                                                          then load it into DWH, and last option is to
                                                          The process of correcting the rejected data
                                                          and then loading into DWH is called Data
                                                          Reprocessing.




                                                                                             Page 5 of 13
As depicted in the figure above, we reject the data during the data validation process, data cleansing
process and data transformation process. The rejected data is collected in temporary files on the ETL
server while the ETL is running. Once the ETL is complete, the rejected data is moved into the Landing
Area.

The end user and the business analyst are provided interfaces to read the reject data in landing area.
They take this as the input, analyze the cause of rejection and correct the data at the source itself.
Once the data is corrected at the source, it is again extracted (depicted in Brown line in the figure).
The corrected data is not expected to get rejected again unless the correction provided was
insufficient.




                                                                                            Page 6 of 13
In some business critical data warehouses which have very very low tolerance towards inaccurate data,
we would need a sophisticated and a fast mechanism of handling rejected data in the landing area.
Here we consider a database to land the data. The database schema is the same as that of source
files/tables. We add two more columns to the schema, one to flag whether the record got rejected in
ETL, and the other to identify the date when the data was sent by the source system. Having a
database gives us an option of easily create applications to access and update the data in the landing
area.

Please note that adding a database in the landing area adds the infrastructure and maintenance costs.
Adding the database would also increase the number of processes in the extraction process, thereby
affecting the performance of ETL.




   1.2.        Infrastructure Exception Handling


Infrastructure related exceptions are caused because of issues in the Network connectivity, the
Database operations and the Operating System.

Common Infrastructure exceptions are




                                                                                          Page 7 of 13
Database Errors like db connection error, Referential integrity constraint failure, primary key
constraint failure, incorrect credentials, data type mismatch, Null in Not Null fields.
Network connection failure causing FTP failure.
Operating system issues on ETL server full causing aborts due to memory insufficiency, un-
mounted file systems, 100% CPU utilization, incorrect file/directory permissions.

                                                       The diagram below depicts the
                                                       exceptions and the process to handle
                                                       them.

                                                       The process of detecting the
                                                       abovementioned exception is generally
                                                       caught by the ETL scheduler which
                                                       checks whether there is a non zero value
                                                       returned by the ETL process.

                                                       If an exception occurs, we make a log
                                                       entry, send email or alerts to the users
                                                       to notify that the ETL process has
                                                       aborted and exit to the Operating
                                                       System with a Non Zero value.

                                                       The notification process alerts the IS
                                                       team to take appropriate action so that
                                                       the ETL process can be restarted once
                                                       the infrastructure issue is resolved.




                                                                                     Page 8 of 13
1.3.   Data Correction in DWH




                                The data in the DWH could be
                                incorrect or inaccurate due to a
                                variety of reasons, mainly

                                    1. Incorrect requirement or
                                missed, leading to incorrect ETL.
                                    2. Incorrect interpretation
                                of requirements leading to
                                incorrect ETL.
                                    3. Uncaught coding defects.
                                    4. Incorrect data from
                                source.




                                The reason 1, 2, and 3 would
                                require us to revisit the ETL code
                                with respect to the incorrect
                                requirements, missed
                                requirements and uncaught
                                defects.

                                The figure below depicts the
                                process to be followed to correct
                                the data already loaded in DWH.




                                Detection

                                Most important is the detection
                                of the inaccurate or incorrect
                                data in DWH. Incorrect data
                                loaded in DWH is usually
                                detected long after the it has
                                been loaded when some end-user
                                identifies it in his/her report.

                                Analysis

                                Once reported, we analyze the
                                report and its metadata. This
                                would require understanding the
                                report metadata, calculation and
                                the SQL generated by the report.

                                                     Page 9 of 13
If there is a no issue in the report definition, we analyze the data in DWH. Once we have pin pointed
the table, attributes and the data in DWH where the inaccuracy is, we perform the root cause of the
inaccuracy.

The root cause would require us to check the data with respect to the requirements, design and code.
The root cause helps us identify the next course of action.

Missing Requirements - If the root cause is massing requirements, then we go to the users and get the
complete requirements.

Misinterpretation of Requirements - Here too we go to the end user and clarify on the misinterpreted
requirement.

Defect in the code - There is a possibility of missing detecting bugs during the testing phase. If
undetected, the bug could cause inaccuracy in data.

Correction Process

In case of missing requirements,

    1.    Get the new requirements from the users.
    2.    Document the new requirements.
    3.    Design the new ETL.
    4.    Code the new ETL.
    5.    Test the new ETL.
    6.    Make the DWH offline.
    7.    Perform the History Load for the new Requirements. This could be possible only when we have
          added new tables or new attributes in the data model.
    8.    Check the report for new requirements.
    9.    If the reports are correct, then implement the new ETL into the regular ETL.
    10.   Perform the catch-up load for the duration the DWH was offline.
    11.   Bring the DWH online.

In case of misinterpreted requirements or undetected bugs,

    1.    Analyze the ETL and identify the changes in it.
    2.    Update the design.
    3.    Correct the code.
    4.    Test the code.
    5.    Create a patch to update the historical data (data already in DWH) to correct it.
    6.    Test the patch.
    7.    Bring the DWH offline.
    8.    Run the patch.
    9.    Check the report for correction.
    10.   If the reports are correct, then implement the corrected ETL.
    11.   Perform the catch-up load for the duration the DWH was offline.
    12.   Bring the DWH online.




                                                                                              Page 10 of 13
2. Error Processing – High Level




The error processing in Target is unique and flawless.

   2.1. Capturing
       All the various source system data is dumped into the landing area as is. All the records
       in the landing area are marked as valid in the first instance during the load.

       On a given schedule, the records are processed from landing area to the staging area
       and all the business validation are executed on these records. Once the staging load is
       finished, all the records which have not been loaded into the staging area are marked as
       invalid record in landing area.

       Information of all the rejected records which have failed will be stored into the error
       tables with error code. There is another table having all reference to the error code.

       Depending on the table(s), we would have multiple business validations for a each
       record. Hence could end up having multiple entries in the error table(s) for a given
       source record.

       The records which have been marked as invalid would be processed for every staging
       load until they are purged or if a corrected record is sent from the source.

   2.2. Error threshold
       If the no. of rejections reach a given threshold limit, mail is sent to EAM / Business data
       quality team informing the abnormal behavior and job is aborted.


                                                                                     Page 11 of 13
Based on the feedback the jobs are rerun/re-triggered manually.



2.3. Purging
   Purging is to delete the previous records which are no more required by a given
   business process.

   Following are the logic applied on various data.

   Purging logic is based on the following:-

   2.3.1. Landing Area
          1. Valid records – Valid records which have been loaded into the Staging area
             will retain only previous 7 days of data. Rest will be purged.

          2. Invalid records - Invalid records which have been errored out from Staging
             area will be retained for 30 days. Rest will be purged.

   2.3.2. Staging Area
          Truncate and load. An Area where we load and make sure data is good before
          we do any changes to warehouse table.

   2.3.3. EDW
          Depending on Business need, data is maintained in EDW.

   2.3.4. Datamart
          Depending on Business need, data is maintained in EDW.

2.4. Purge threshold
   During purging, the business can set a threshold limit to the number of records being
   purged. If while deleting the threshold limit is crossed. The Purge jobs are automatically
   aborted and a mail sent to the EAM / Business data quality team for confirmation.

   Once the business confirms, the aborted jobs are later triggered manually.



2.5. Appendix

   2.5.1. About Target
          TBU

                                                                                Page 12 of 13
2.5.2. Reference
     The Exception Handling Overview is an extract from www.dwhinfo.com written
     by Krishan.Vinayak@target.com

2.5.3. Other Contributors


           Krishan.Vinayak – Delivery Manager

           Devanathan.Rajagopalan – Senior Technical Architect

           Asis.Mohanty – BI Manager

           Joseph.Raj – Technical Architect




                                                                   Page 13 of 13

Contenu connexe

Tendances

Network planning and optimization using atoll
Network  planning  and optimization  using  atollNetwork  planning  and optimization  using  atoll
Network planning and optimization using atollHamed Almsafer
 
Less05 asm instance
Less05 asm instanceLess05 asm instance
Less05 asm instanceAmit Bhalla
 
User Management and Privileges - pfSense Hangout February 2015
User Management and Privileges - pfSense Hangout February 2015User Management and Privileges - pfSense Hangout February 2015
User Management and Privileges - pfSense Hangout February 2015Netgate
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBANikhil Kumar
 
Modul cisco-packet-tracer
Modul cisco-packet-tracerModul cisco-packet-tracer
Modul cisco-packet-tracerdedipj89
 
An Introduction to Self-Organizing Networks (SON)
An Introduction to Self-Organizing Networks (SON)An Introduction to Self-Organizing Networks (SON)
An Introduction to Self-Organizing Networks (SON)eXplanoTech
 
Память в Java. Garbage Collector
Память в Java. Garbage CollectorПамять в Java. Garbage Collector
Память в Java. Garbage CollectorOlexandra Dmytrenko
 
給開發人員的資料庫效能建議
給開發人員的資料庫效能建議給開發人員的資料庫效能建議
給開發人員的資料庫效能建議Rico Chen
 
Query optimization
Query optimizationQuery optimization
Query optimizationPooja Dixit
 
Oracle Key Vault Overview
Oracle Key Vault OverviewOracle Key Vault Overview
Oracle Key Vault OverviewTroy Kitch
 
Routing in Wireless Sensor Networks
Routing in Wireless Sensor NetworksRouting in Wireless Sensor Networks
Routing in Wireless Sensor Networkssashar86
 
Oracle Audit vault
Oracle Audit vaultOracle Audit vault
Oracle Audit vaultuzzal basak
 

Tendances (14)

Network planning and optimization using atoll
Network  planning  and optimization  using  atollNetwork  planning  and optimization  using  atoll
Network planning and optimization using atoll
 
Less05 asm instance
Less05 asm instanceLess05 asm instance
Less05 asm instance
 
User Management and Privileges - pfSense Hangout February 2015
User Management and Privileges - pfSense Hangout February 2015User Management and Privileges - pfSense Hangout February 2015
User Management and Privileges - pfSense Hangout February 2015
 
RAC - The Savior of DBA
RAC - The Savior of DBARAC - The Savior of DBA
RAC - The Savior of DBA
 
Modul cisco-packet-tracer
Modul cisco-packet-tracerModul cisco-packet-tracer
Modul cisco-packet-tracer
 
Database anomalies
Database anomaliesDatabase anomalies
Database anomalies
 
An Introduction to Self-Organizing Networks (SON)
An Introduction to Self-Organizing Networks (SON)An Introduction to Self-Organizing Networks (SON)
An Introduction to Self-Organizing Networks (SON)
 
Память в Java. Garbage Collector
Память в Java. Garbage CollectorПамять в Java. Garbage Collector
Память в Java. Garbage Collector
 
給開發人員的資料庫效能建議
給開發人員的資料庫效能建議給開發人員的資料庫效能建議
給開發人員的資料庫效能建議
 
Query optimization
Query optimizationQuery optimization
Query optimization
 
Oracle Key Vault Overview
Oracle Key Vault OverviewOracle Key Vault Overview
Oracle Key Vault Overview
 
Routing in Wireless Sensor Networks
Routing in Wireless Sensor NetworksRouting in Wireless Sensor Networks
Routing in Wireless Sensor Networks
 
Oracle Audit vault
Oracle Audit vaultOracle Audit vault
Oracle Audit vault
 
Spanning Tree Protocol
Spanning Tree ProtocolSpanning Tree Protocol
Spanning Tree Protocol
 

Similaire à BI Error Processing Framework

Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.DurgaDeviP2
 
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...Extra Technology
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanKirti Bhushan
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environmentDavid Walker
 
Software architecture case study - why and why not sql server replication
Software architecture   case study - why and why not sql server replicationSoftware architecture   case study - why and why not sql server replication
Software architecture case study - why and why not sql server replicationShahzad
 
Data Integration In Data Mining.pdf
Data Integration In Data Mining.pdfData Integration In Data Mining.pdf
Data Integration In Data Mining.pdfMaria Mathe
 
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...Extra Technology
 
DATA BASE.docx
DATA BASE.docxDATA BASE.docx
DATA BASE.docxwrite31
 
Identify_Stability_Problems
Identify_Stability_ProblemsIdentify_Stability_Problems
Identify_Stability_ProblemsMichael Materie
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyDonna Guazzaloca-Zehl
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouhuttenangela
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdfssuserf8f9b2
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptxJesusaEspeleta
 
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP OpsIRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP OpsIRJET Journal
 
Mocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_FinalMocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_Finalhjperry
 

Similaire à BI Error Processing Framework (20)

Machine learning topics machine learning algorithm into three main parts.
Machine learning topics  machine learning algorithm into three main parts.Machine learning topics  machine learning algorithm into three main parts.
Machine learning topics machine learning algorithm into three main parts.
 
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
WA CA 7 Edition r12 Database Conversion - CA Workload Automation Technology S...
 
Testing data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti BhushanTesting data warehouse applications by Kirti Bhushan
Testing data warehouse applications by Kirti Bhushan
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
 
Guide on Raid Data Recovery
Guide on Raid Data RecoveryGuide on Raid Data Recovery
Guide on Raid Data Recovery
 
Software architecture case study - why and why not sql server replication
Software architecture   case study - why and why not sql server replicationSoftware architecture   case study - why and why not sql server replication
Software architecture case study - why and why not sql server replication
 
Data Integration In Data Mining.pdf
Data Integration In Data Mining.pdfData Integration In Data Mining.pdf
Data Integration In Data Mining.pdf
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
 
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
CA 7 r11.3 to r12 DB Conversion Presentation - CA Workload Automation Technol...
 
DATA BASE.docx
DATA BASE.docxDATA BASE.docx
DATA BASE.docx
 
ETL Process
ETL ProcessETL Process
ETL Process
 
Identify_Stability_Problems
Identify_Stability_ProblemsIdentify_Stability_Problems
Identify_Stability_Problems
 
S18 das
S18 dasS18 das
S18 das
 
Migration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication TechnologyMigration to Oracle 12c Made Easy Using Replication Technology
Migration to Oracle 12c Made Easy Using Replication Technology
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
 
Discussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ouDiscussion 1 The incorrect implementation of databases ou
Discussion 1 The incorrect implementation of databases ou
 
oracle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdforacle-adw-melts snowflake-report.pdf
oracle-adw-melts snowflake-report.pdf
 
Extract, Transform and Load.pptx
Extract, Transform and Load.pptxExtract, Transform and Load.pptx
Extract, Transform and Load.pptx
 
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP OpsIRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
IRJET - The 3-Level Database Architectural Design for OLAP and OLTP Ops
 
Mocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_FinalMocca International GmbH _Q500 analysis and Recommendations_Final
Mocca International GmbH _Q500 analysis and Recommendations_Final
 

Plus de Asis Mohanty

Cloud Data Warehouses
Cloud Data WarehousesCloud Data Warehouses
Cloud Data WarehousesAsis Mohanty
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsAsis Mohanty
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0Asis Mohanty
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseAsis Mohanty
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataAsis Mohanty
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteriaAsis Mohanty
 
Cognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS ComparisonCognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS ComparisonAsis Mohanty
 
Reporting/Dashboard Evaluations
Reporting/Dashboard EvaluationsReporting/Dashboard Evaluations
Reporting/Dashboard EvaluationsAsis Mohanty
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyAsis Mohanty
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradataAsis Mohanty
 
Change data capture the journey to real time bi
Change data capture the journey to real time biChange data capture the journey to real time bi
Change data capture the journey to real time biAsis Mohanty
 

Plus de Asis Mohanty (14)

Cloud Data Warehouses
Cloud Data WarehousesCloud Data Warehouses
Cloud Data Warehouses
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Apache TAJO
Apache TAJOApache TAJO
Apache TAJO
 
Cassandra basics 2.0
Cassandra basics 2.0Cassandra basics 2.0
Cassandra basics 2.0
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouseHadoop Architecture Options for Existing Enterprise DataWarehouse
Hadoop Architecture Options for Existing Enterprise DataWarehouse
 
Netezza vs Teradata vs Exadata
Netezza vs Teradata vs ExadataNetezza vs Teradata vs Exadata
Netezza vs Teradata vs Exadata
 
ETL tool evaluation criteria
ETL tool evaluation criteriaETL tool evaluation criteria
ETL tool evaluation criteria
 
COGNOS Vs OBIEE
COGNOS Vs OBIEECOGNOS Vs OBIEE
COGNOS Vs OBIEE
 
Cognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS ComparisonCognos vs Hyperion vs SSAS Comparison
Cognos vs Hyperion vs SSAS Comparison
 
Reporting/Dashboard Evaluations
Reporting/Dashboard EvaluationsReporting/Dashboard Evaluations
Reporting/Dashboard Evaluations
 
Oracle to Netezza Migration Casestudy
Oracle to Netezza Migration CasestudyOracle to Netezza Migration Casestudy
Oracle to Netezza Migration Casestudy
 
Netezza vs teradata
Netezza vs teradataNetezza vs teradata
Netezza vs teradata
 
Change data capture the journey to real time bi
Change data capture the journey to real time biChange data capture the journey to real time bi
Change data capture the journey to real time bi
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

BI Error Processing Framework

  • 1. Target Corporation BI Framework Error Processing Mohan.Kumar2
  • 2. Table of Contents 1. Exception Handling Overview (ref 2.5.2) ....................................................................................... 3 1.1. Data Reprocessing......................................................................................................................... 5 1.2. Infrastructure Exception Handling ................................................................................................ 7 1.3. Data Correction in DWH................................................................................................................ 9 2. Error Processing – High Level .............................................................................................................. 11 2.1. Capturing..................................................................................................................................... 11 2.2. Error threshold ............................................................................................................................ 11 2.3. Purging ........................................................................................................................................ 12 2.3.1. Landing Area ....................................................................................................................... 12 2.3.2. Staging Area ........................................................................................................................ 12 2.3.3. EDW..................................................................................................................................... 12 2.3.4. Datamart ............................................................................................................................. 12 2.4. Purge threshold........................................................................................................................... 12 2.5. Appendix ..................................................................................................................................... 12 2.5.1. About Target ....................................................................................................................... 12 2.5.2. Reference ............................................................................................................................ 13 2.5.3. Other Contributors.............................................................................................................. 13 Page 2 of 13
  • 3. 1. Exception Handling Overview (ref 2.5.2) Exception Handling deals with any abnormal termination, unacceptable event or incorrect data that can impact the data flow or accuracy of data in the warehouse/mart. Exceptions in ETL could be classified as Data Related Exceptions and Infrastructure Related Exceptions. Please Note: In Infrastructure Related exception, Infrastructure glitches are not classified as exception as they are temporary and are resolved by the time the job(s) is/are rerun. But, the logs are tracked and maintained. The process of recovering or gracefully exiting when an exception occurs is called exception handling. Page 3 of 13
  • 4. Data related exceptions are caused because of incorrect data format, incorrect value, incomplete data from the source system. This leads to Data validation exceptions and Data Rejects. The process of handling the Data Rejects is called Data Reprocessing. Page 4 of 13
  • 5. Infrastructure related exceptions are caused because of issues in the Network , the Database and the Operating System. Common Infrastructure exceptions are FTP failure, Database connectivity failure, File system full etc. The data related exceptions are usually documented in the requirements, if not they must be because if the data related exceptions are not handled they lead to inaccurate data in the warehouse/mart. We also keep a threshold of maximum number of validation or reject failures allowed per load. Any value above the threshold would mean the data would be too inaccurate because to too many rejections. There is one more exception which is the presence of inaccurate or incorrect data in the warehouse. This could happen due to 1. Incorrect requirement or missed, leading to incorrect ETL. 2. Incorrect interpretation of requirements leading to incorrect ETL. 3. Uncaught coding defects. 4. Incorrect data from source. The process of Correction of the data already loaded in the warehouse involves fixing the data already loaded and also preventing the inaccuracy to persist in the future. 1.1. Data Reprocessing Reprocessing is is an exception handling process which involves the correction of the data that is could not be loaded into the warehouse/mart. There could be many reasons why source data gets rejected from DWH. Most common of them are Data Rejection - Source data not matching critical business codes/attributes. This is called Lookup Failure in ETL. Data Cleansing - Source data containing junk values for business critical fields hence getting rejected during data validation. There are 3 options to deal with the rejected records. One, We could leave the rejected data out of DWH or, two we could correct it based on whether the rejected field is critical to business and is worth reprocessing, and then load it into DWH, and last option is to The process of correcting the rejected data and then loading into DWH is called Data Reprocessing. Page 5 of 13
  • 6. As depicted in the figure above, we reject the data during the data validation process, data cleansing process and data transformation process. The rejected data is collected in temporary files on the ETL server while the ETL is running. Once the ETL is complete, the rejected data is moved into the Landing Area. The end user and the business analyst are provided interfaces to read the reject data in landing area. They take this as the input, analyze the cause of rejection and correct the data at the source itself. Once the data is corrected at the source, it is again extracted (depicted in Brown line in the figure). The corrected data is not expected to get rejected again unless the correction provided was insufficient. Page 6 of 13
  • 7. In some business critical data warehouses which have very very low tolerance towards inaccurate data, we would need a sophisticated and a fast mechanism of handling rejected data in the landing area. Here we consider a database to land the data. The database schema is the same as that of source files/tables. We add two more columns to the schema, one to flag whether the record got rejected in ETL, and the other to identify the date when the data was sent by the source system. Having a database gives us an option of easily create applications to access and update the data in the landing area. Please note that adding a database in the landing area adds the infrastructure and maintenance costs. Adding the database would also increase the number of processes in the extraction process, thereby affecting the performance of ETL. 1.2. Infrastructure Exception Handling Infrastructure related exceptions are caused because of issues in the Network connectivity, the Database operations and the Operating System. Common Infrastructure exceptions are Page 7 of 13
  • 8. Database Errors like db connection error, Referential integrity constraint failure, primary key constraint failure, incorrect credentials, data type mismatch, Null in Not Null fields. Network connection failure causing FTP failure. Operating system issues on ETL server full causing aborts due to memory insufficiency, un- mounted file systems, 100% CPU utilization, incorrect file/directory permissions. The diagram below depicts the exceptions and the process to handle them. The process of detecting the abovementioned exception is generally caught by the ETL scheduler which checks whether there is a non zero value returned by the ETL process. If an exception occurs, we make a log entry, send email or alerts to the users to notify that the ETL process has aborted and exit to the Operating System with a Non Zero value. The notification process alerts the IS team to take appropriate action so that the ETL process can be restarted once the infrastructure issue is resolved. Page 8 of 13
  • 9. 1.3. Data Correction in DWH The data in the DWH could be incorrect or inaccurate due to a variety of reasons, mainly 1. Incorrect requirement or missed, leading to incorrect ETL. 2. Incorrect interpretation of requirements leading to incorrect ETL. 3. Uncaught coding defects. 4. Incorrect data from source. The reason 1, 2, and 3 would require us to revisit the ETL code with respect to the incorrect requirements, missed requirements and uncaught defects. The figure below depicts the process to be followed to correct the data already loaded in DWH. Detection Most important is the detection of the inaccurate or incorrect data in DWH. Incorrect data loaded in DWH is usually detected long after the it has been loaded when some end-user identifies it in his/her report. Analysis Once reported, we analyze the report and its metadata. This would require understanding the report metadata, calculation and the SQL generated by the report. Page 9 of 13
  • 10. If there is a no issue in the report definition, we analyze the data in DWH. Once we have pin pointed the table, attributes and the data in DWH where the inaccuracy is, we perform the root cause of the inaccuracy. The root cause would require us to check the data with respect to the requirements, design and code. The root cause helps us identify the next course of action. Missing Requirements - If the root cause is massing requirements, then we go to the users and get the complete requirements. Misinterpretation of Requirements - Here too we go to the end user and clarify on the misinterpreted requirement. Defect in the code - There is a possibility of missing detecting bugs during the testing phase. If undetected, the bug could cause inaccuracy in data. Correction Process In case of missing requirements, 1. Get the new requirements from the users. 2. Document the new requirements. 3. Design the new ETL. 4. Code the new ETL. 5. Test the new ETL. 6. Make the DWH offline. 7. Perform the History Load for the new Requirements. This could be possible only when we have added new tables or new attributes in the data model. 8. Check the report for new requirements. 9. If the reports are correct, then implement the new ETL into the regular ETL. 10. Perform the catch-up load for the duration the DWH was offline. 11. Bring the DWH online. In case of misinterpreted requirements or undetected bugs, 1. Analyze the ETL and identify the changes in it. 2. Update the design. 3. Correct the code. 4. Test the code. 5. Create a patch to update the historical data (data already in DWH) to correct it. 6. Test the patch. 7. Bring the DWH offline. 8. Run the patch. 9. Check the report for correction. 10. If the reports are correct, then implement the corrected ETL. 11. Perform the catch-up load for the duration the DWH was offline. 12. Bring the DWH online. Page 10 of 13
  • 11. 2. Error Processing – High Level The error processing in Target is unique and flawless. 2.1. Capturing All the various source system data is dumped into the landing area as is. All the records in the landing area are marked as valid in the first instance during the load. On a given schedule, the records are processed from landing area to the staging area and all the business validation are executed on these records. Once the staging load is finished, all the records which have not been loaded into the staging area are marked as invalid record in landing area. Information of all the rejected records which have failed will be stored into the error tables with error code. There is another table having all reference to the error code. Depending on the table(s), we would have multiple business validations for a each record. Hence could end up having multiple entries in the error table(s) for a given source record. The records which have been marked as invalid would be processed for every staging load until they are purged or if a corrected record is sent from the source. 2.2. Error threshold If the no. of rejections reach a given threshold limit, mail is sent to EAM / Business data quality team informing the abnormal behavior and job is aborted. Page 11 of 13
  • 12. Based on the feedback the jobs are rerun/re-triggered manually. 2.3. Purging Purging is to delete the previous records which are no more required by a given business process. Following are the logic applied on various data. Purging logic is based on the following:- 2.3.1. Landing Area 1. Valid records – Valid records which have been loaded into the Staging area will retain only previous 7 days of data. Rest will be purged. 2. Invalid records - Invalid records which have been errored out from Staging area will be retained for 30 days. Rest will be purged. 2.3.2. Staging Area Truncate and load. An Area where we load and make sure data is good before we do any changes to warehouse table. 2.3.3. EDW Depending on Business need, data is maintained in EDW. 2.3.4. Datamart Depending on Business need, data is maintained in EDW. 2.4. Purge threshold During purging, the business can set a threshold limit to the number of records being purged. If while deleting the threshold limit is crossed. The Purge jobs are automatically aborted and a mail sent to the EAM / Business data quality team for confirmation. Once the business confirms, the aborted jobs are later triggered manually. 2.5. Appendix 2.5.1. About Target TBU Page 12 of 13
  • 13. 2.5.2. Reference The Exception Handling Overview is an extract from www.dwhinfo.com written by Krishan.Vinayak@target.com 2.5.3. Other Contributors Krishan.Vinayak – Delivery Manager Devanathan.Rajagopalan – Senior Technical Architect Asis.Mohanty – BI Manager Joseph.Raj – Technical Architect Page 13 of 13