SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
@joe_Caserta#DGIQ2015
The Fundamentals of Data Quality
Understanding, Planning and Achieving
Data Quality in Your Organization
Joe Caserta
@joe_Caserta#DGIQ2015
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit
Data Analysis, Data Warehousing and
Business Intelligence since 1996
Began consulting database programing
and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in
Intelligent Enterprise
Launched Data Science, Data
Interaction and Cloud practices Laser focus on extending Data
Analytics with Big Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Top 20 Big Data
Consulting - CIO Review
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing
(BDW) Meetup NYC: 2,000+ Members
2015 Awarded for getting data out
of SAP for data analytics
Established best practices for big data
ecosystem implementations
Joe Caserta Timeline
@joe_Caserta#DGIQ2015
Data Quality
• Foremost reason for data warehouse failure is lack of data
accuracy
Accurate data means:
Correct
Unambiguous
Consistent
Complete
• Every Data Management system needs a data quality sub-system to
some degree
@joe_Caserta#DGIQ2015
The Data Quality Pipeline
Extract Clean Conform Deliver
Extracted
data staged
to disk
Clean data
staged to
disk
Conformed
data staged
to disk
Cleansed
data ready
for delivery
Operations: Scheduling, Error Handling, Data Quality Assurance
• Extract. The raw data coming from source systems
• Clean. Data quality processing involves many discrete steps, including checking for valid
values, ensuring consistency, removing duplicates, and enforcement of complex business
rules
• Conform. Required whenever two or more data sources are merged in the data
warehouse.
• Deliver. The final step is physically structuring the data into a set of dimensional models
@joe_Caserta#DGIQ2015
• To trust your information a robust set of tools for continuous
monitoring is needed
• Accuracy and completeness of data must be ensured
• Any piece of information in the data ecosystem must have
monitoring:
• Basic Stats: source to target counts
• Error Events: did we trap any errors during processing
• Business Checks: is the metric “within expectations”, How
does it compare with an abridged alternate calculation.
Data Quality Monitoring
@joe_Caserta#DGIQ2015
• Every data element has a System-of-Record
• The System-of-record is the originating source of data
• Data may be copied, moved, manipulated, transformed, altered, cleansed,
or made corrupt throughout the enterprise
• If you don’t use the system-of-record data quality will be nearly impossible.
• The further downstream you go from the originating data source, you
increase the risk of corrupt data.
• Barring rare exceptions, maintain the practice of sourcing data only from the
system-of-record.
Determine the System of Record
@joe_Caserta#DGIQ2015
Cleaning Data from Multiple Sources
Merge lists
on multiple
attributes
Department 1
Customer List
Department 3
Customer List
Department 3
Customer List
Revised Master
Customer List
Retrieve/Ass
ign New
Master
Customer
Key
Remove
Duplicates
• Identify the source systems
• Understand the source
systems
• Create record matching logic
• Establish survivorship rules
• Establish non-key attribute
business rules
• Assign Surrogate Keys
• Load conformed dimension
@joe_Caserta#DGIQ2015
Be
Corrective
Be Fast
Be
Transparen
t
Be Thorough
Data Quality Priorities
• Be Thorough
• Be Fast
• Be Corrective
• Be Transparent
@joe_Caserta#DGIQ2015
Completeness Versus Speed
Data Quality
SpeedtoValue
Fast
Slow
Transparent Corrective
@joe_Caserta#DGIQ2015
Corrective Versus Transparent
• Corrective
– Hides operational
deficiencies
– ETL complex algorithms
– DW differs from OLTP
– Slows ETL Processes
• Transparent
– Highlight Issues
– Fast Delivery
– DW matches OLTP
– Forces source system
cleanup
@joe_Caserta#DGIQ2015
Data Quality Issues Policy
• Category A Issues must be addressed at the data source
• Category B Issues should be addressed at the data source even if there
might be creative ways of deducing or recreating the derelict information
• Category C Issues, for a host of reasons, are best addressed in the data-
quality ETL rather than at the source
• Category D Data-quality issues can only be pragmatically resolved in the
ETL system
@joe_Caserta#DGIQ2015
Data Quality Issues Bell Curve
MUST be addressed
at the SOURCE
BEST addressed
at the SOURCE
BEST
Addressed
In ETL
MUST be
Addressed
In ETL
Category A Category B Category C Category D
Political DMZ
ETL Focus is here
Universe of Known Data Quality Issues
@joe_Caserta#DGIQ2015
Types of Data Quality Enforcement
• Column Property Enforcement
• Structure Enforcement
• Data Enforcement
• Value Enforcement
@joe_Caserta#DGIQ2015
Column Property Enforcement
• Null values in required columns
• Numeric values that fall outside of range
• Columns whose lengths are unexpected
• Columns that contain data outside of allowed values
• Adherence to a required pattern
@joe_Caserta#DGIQ2015
Structure Enforcement
• Consistent Data Types
• Functional Dependencies
• Referential Integrity
• Hierarchical Relationships
• Domain Sensibility
@joe_Caserta#DGIQ2015
Data and Value Enforcement
• Business Rules
• Missing Data Values
• Incorrect Data Values
• Embedded Meanings in Data Values
• Domain Redundancy
@joe_Caserta#DGIQ2015
Data Quality Failure Options
• 1. Pass the record with no errors
• 2. Pass the record, flag offending column values
• 3. Reject the record
• 4. Stop the ETL job stream
• 5. Fix on the Fly
@joe_Caserta#DGIQ2015
Assessing Data Quality – It’s not as easy as it looks
Data Quality Violation Action
1. Incoming Employee has a termination date earlier than their hire date
2. Compensation fact has currency that does not exist in the currency
dimension
3. End Date is not a valid date
4. Bill Amount is 13,562,583.67 when bills usually don't exceed 1.3 million
5. The source for the region dimension contains a city 'New Yourk'
6. More than 90% of the prices are NULL while loading the Products
dimension
7. The customer key is not available during the sales detail fact table load
8. Column is not found while attempting to extract the status of an employee
9. A product with existing facts has been deleted from the source system
10. The description is empty for a new product in the Product dimension
@joe_Caserta#DGIQ2015
Tracking Data Quality Failures
• Error Event Star- Schema
– Enables trend analysis of errors and exceptions
• Audit Dimension
– Captures specific quality context of individual fact table
records
• Refer to The Data Warehouse ETL Toolkit pp.126-129 for
more information on tracking data quality errors
@joe_Caserta#DGIQ2015
Error Event Table Schema
• Each error instance of each data
quality check is captured
• Implemented as sub-system of
ETL
• Each fact stored unique identifier
of the defective source system
record
@joe_Caserta#DGIQ2015
Audit Dimension
• Fact table contains a foreign key to
audit key
• Dummy (OK) row for records with
no defects
• Audit dimensions can be unique to
each fact table
• Error Event Fact can be used to fill in
the measures of the audit dimension
@joe_Caserta#DGIQ2015
Data Quality Strategy
• 1. Perform Data Profiling
• 2. Document Data Defects
• 3. Determine Data Defect Responsibility
• 4. Define Data Quality Rules
• 5. Obtain Sign-off for Correction Logic
• 6. Integrate rules with Logical Data Mapping
@joe_Caserta#DGIQ2015
Enrollments
Claims
Finance
ETL
Horizontally Scalable Environment - Optimized for Analytics
NoSQL
Databases
ETL
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
The Evolution of the Enterprise Data Hub
Big Data Lake
ETL
@joe_Caserta#DGIQ2015
What’s Old is New Again
 Before Data Warehousing DG/DQ
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
 Before Data Lake DG/DQ
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance
will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to the old days:
No Trust = Not Actionable
@joe_Caserta#DGIQ2015
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Enterprise Data Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 ETL cleans, conforms, consolidates, enriches each tier
 Only top tier of the pyramid is fully governed
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
ETL/DQ
ETL/DQ
ETL/DQ
ETL
@joe_Caserta#DGIQ2015
Recommended Reading
Ralph Kimball
The Data Warehouse Lifecycle
Toolkit, 2nd Edition
Jack E Olson
Data Quality, the Accuracy
Dimension
Ralph Kimball, Joe Caserta
The Data Warehouse ETL
Toolkit
@joe_Caserta#DGIQ2015
Formal DW & ETL Training in NYC, 2015
Join us for one or both training courses combining two unique
workshops from international data warehousing veterans.
Workshops:
Sept 21-22 (2 days), Agile Data Warehousing with Lawrence Corr
Sept 23-24 (2 days), ETL Architecture and Design with Joe Caserta
SAVE $300 BY REGISTERING BEFORE JUNE 30TH!
Thanks! We look forward to seeing you there.
@joe_Caserta#DGIQ2015
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

Contenu connexe

Tendances

Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsMethod360
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratchdmurph4
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slidesNicolas Sarramagna
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)raj.kamal13
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive AnalyticsNandita Nityanandam
 
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...Alan D. Duncan
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeStefan Kühn
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousingShahed Khalili
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousingsumit621
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project reportsonalighai
 
A Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata ManagementA Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata ManagementSaachiShankar
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)SQALab
 

Tendances (20)

Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source Systems
 
Building a Data Quality Program from Scratch
Building a Data Quality Program from ScratchBuilding a Data Quality Program from Scratch
Building a Data Quality Program from Scratch
 
Data Quality Presentation
Data Quality PresentationData Quality Presentation
Data Quality Presentation
 
( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides( Big ) Data Management - Data Quality - Global concepts in 5 slides
( Big ) Data Management - Data Quality - Global concepts in 5 slides
 
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)Data Quality Testing Generic (http://www.geektester.blogspot.com/)
Data Quality Testing Generic (http://www.geektester.blogspot.com/)
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 
3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics3 Ways Tableau Improves Predictive Analytics
3 Ways Tableau Improves Predictive Analytics
 
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...Data Quality in  Data Warehouse and Business Intelligence Environments - Disc...
Data Quality in Data Warehouse and Business Intelligence Environments - Disc...
 
Data mining
Data miningData mining
Data mining
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
Data quality - The True Big Data Challenge
Data quality - The True Big Data ChallengeData quality - The True Big Data Challenge
Data quality - The True Big Data Challenge
 
An introduction to data warehousing
An introduction to data warehousingAn introduction to data warehousing
An introduction to data warehousing
 
Chapter 13 data warehousing
Chapter 13   data warehousingChapter 13   data warehousing
Chapter 13 data warehousing
 
Data warehousing and business intelligence project report
Data warehousing and business intelligence project reportData warehousing and business intelligence project report
Data warehousing and business intelligence project report
 
Data analytics
Data analyticsData analytics
Data analytics
 
Part1
Part1Part1
Part1
 
A Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata ManagementA Step-by-Step Guide to Metadata Management
A Step-by-Step Guide to Metadata Management
 
Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)Тестирование данных с помощью Data Quality Services (MS SQL 12)
Тестирование данных с помощью Data Quality Services (MS SQL 12)
 

En vedette

Build_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperBuild_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperJane Roberts
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...IBM
 
7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)sz7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)szCzímer Zoltán
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic
 
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...VMware Tanzu
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraCaserta
 
NUON Rens Weijers
NUON Rens WeijersNUON Rens Weijers
NUON Rens WeijersBigDataExpo
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)William Yeh
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenBigDataExpo
 
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...EMC
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerData Con LA
 
DFW meetup Cognitive services - parashar - feb 22
DFW meetup Cognitive services -  parashar - feb 22DFW meetup Cognitive services -  parashar - feb 22
DFW meetup Cognitive services - parashar - feb 22Parashar Shah
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPrimend
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceAmazon Web Services
 
Cyberbullying in the Middle Years
Cyberbullying in the Middle YearsCyberbullying in the Middle Years
Cyberbullying in the Middle Yearselketeaches
 

En vedette (20)

Build_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaperBuild_Buy_StreamAnalytix_WhitePaper
Build_Buy_StreamAnalytix_WhitePaper
 
1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...1524 how ibm's big data solution can help you gain insight into your data cen...
1524 how ibm's big data solution can help you gain insight into your data cen...
 
Cloud developer evolution
Cloud developer evolutionCloud developer evolution
Cloud developer evolution
 
7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)sz7+1 hiba, amit Te is elkövet(het)sz
7+1 hiba, amit Te is elkövet(het)sz
 
Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016Sumo Logic Quickstart - Nv 2016
Sumo Logic Quickstart - Nv 2016
 
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
 
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and CassandraLow-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
Low-Latency Analytics with NoSQL – Introduction to Storm and Cassandra
 
Azure Key Vault
Azure Key VaultAzure Key Vault
Azure Key Vault
 
NUON Rens Weijers
NUON Rens WeijersNUON Rens Weijers
NUON Rens Weijers
 
Pesla
PeslaPesla
Pesla
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
從系統思考看 DevOps:以 microservices 為例 (DevOps: a system dynamics perspective)
 
Giovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDrivenGiovanni Lanzani GoDataDriven
Giovanni Lanzani GoDataDriven
 
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
Disruptive Data Science - How Data Science and Big Data are Transforming Busi...
 
Delivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea UrsanerDelivering Quality Open Data by Chelsea Ursaner
Delivering Quality Open Data by Chelsea Ursaner
 
DFW meetup Cognitive services - parashar - feb 22
DFW meetup Cognitive services -  parashar - feb 22DFW meetup Cognitive services -  parashar - feb 22
DFW meetup Cognitive services - parashar - feb 22
 
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendusedPraktiline pilvekonverents - IT haldust hõlbustavad uuendused
Praktiline pilvekonverents - IT haldust hõlbustavad uuendused
 
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & TalendIntroducing the Big Data Ecosystem with Caserta Concepts & Talend
Introducing the Big Data Ecosystem with Caserta Concepts & Talend
 
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with DynatraceHow Verizon Innovates Through AI-Driven DevOps with Dynatrace
How Verizon Innovates Through AI-Driven DevOps with Dynatrace
 
Cyberbullying in the Middle Years
Cyberbullying in the Middle YearsCyberbullying in the Middle Years
Cyberbullying in the Middle Years
 

Similaire à DGIQ 2015 The Fundamentals of Data Quality

What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
When the business needs intelligence (15Oct2014)
When the business needs intelligence   (15Oct2014)When the business needs intelligence   (15Oct2014)
When the business needs intelligence (15Oct2014)Dipti Patil
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeCaserta
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality RightDATAVERSITY
 
You Need a Data Catalog. Do You Know Why?
 You Need a Data Catalog. Do You Know Why? You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Objective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and EffectivenessObjective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and EffectivenessPersonifyMarketing
 
Akili Data Integration using PPDM
Akili Data Integration using PPDMAkili Data Integration using PPDM
Akili Data Integration using PPDMrnaramore
 
Automate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayAutomate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayTorana, Inc.
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?RTTS
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?Precisely
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 

Similaire à DGIQ 2015 The Fundamentals of Data Quality (20)

What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
When the business needs intelligence (15Oct2014)
When the business needs intelligence   (15Oct2014)When the business needs intelligence   (15Oct2014)
When the business needs intelligence (15Oct2014)
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Getting Data Quality Right
Getting Data Quality RightGetting Data Quality Right
Getting Data Quality Right
 
You Need a Data Catalog. Do You Know Why?
 You Need a Data Catalog. Do You Know Why? You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Objective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and EffectivenessObjective Benchmarking for Improved Analytics Health and Effectiveness
Objective Benchmarking for Improved Analytics Health and Effectiveness
 
Akili Data Integration using PPDM
Akili Data Integration using PPDMAkili Data Integration using PPDM
Akili Data Integration using PPDM
 
Automate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile wayAutomate data warehouse etl testing and migration testing the agile way
Automate data warehouse etl testing and migration testing the agile way
 
What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?What is a Data Warehouse and How Do I Test It?
What is a Data Warehouse and How Do I Test It?
 
You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?You Need a Data Catalog. Do You Know Why?
You Need a Data Catalog. Do You Know Why?
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 

Plus de Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 

Plus de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 

Dernier

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 

Dernier (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

DGIQ 2015 The Fundamentals of Data Quality

  • 1. @joe_Caserta#DGIQ2015 The Fundamentals of Data Quality Understanding, Planning and Achieving Data Quality in Your Organization Joe Caserta
  • 2. @joe_Caserta#DGIQ2015 Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Top 20 Big Data Consulting - CIO Review Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2015 Awarded for getting data out of SAP for data analytics Established best practices for big data ecosystem implementations Joe Caserta Timeline
  • 3. @joe_Caserta#DGIQ2015 Data Quality • Foremost reason for data warehouse failure is lack of data accuracy Accurate data means: Correct Unambiguous Consistent Complete • Every Data Management system needs a data quality sub-system to some degree
  • 4. @joe_Caserta#DGIQ2015 The Data Quality Pipeline Extract Clean Conform Deliver Extracted data staged to disk Clean data staged to disk Conformed data staged to disk Cleansed data ready for delivery Operations: Scheduling, Error Handling, Data Quality Assurance • Extract. The raw data coming from source systems • Clean. Data quality processing involves many discrete steps, including checking for valid values, ensuring consistency, removing duplicates, and enforcement of complex business rules • Conform. Required whenever two or more data sources are merged in the data warehouse. • Deliver. The final step is physically structuring the data into a set of dimensional models
  • 5. @joe_Caserta#DGIQ2015 • To trust your information a robust set of tools for continuous monitoring is needed • Accuracy and completeness of data must be ensured • Any piece of information in the data ecosystem must have monitoring: • Basic Stats: source to target counts • Error Events: did we trap any errors during processing • Business Checks: is the metric “within expectations”, How does it compare with an abridged alternate calculation. Data Quality Monitoring
  • 6. @joe_Caserta#DGIQ2015 • Every data element has a System-of-Record • The System-of-record is the originating source of data • Data may be copied, moved, manipulated, transformed, altered, cleansed, or made corrupt throughout the enterprise • If you don’t use the system-of-record data quality will be nearly impossible. • The further downstream you go from the originating data source, you increase the risk of corrupt data. • Barring rare exceptions, maintain the practice of sourcing data only from the system-of-record. Determine the System of Record
  • 7. @joe_Caserta#DGIQ2015 Cleaning Data from Multiple Sources Merge lists on multiple attributes Department 1 Customer List Department 3 Customer List Department 3 Customer List Revised Master Customer List Retrieve/Ass ign New Master Customer Key Remove Duplicates • Identify the source systems • Understand the source systems • Create record matching logic • Establish survivorship rules • Establish non-key attribute business rules • Assign Surrogate Keys • Load conformed dimension
  • 8. @joe_Caserta#DGIQ2015 Be Corrective Be Fast Be Transparen t Be Thorough Data Quality Priorities • Be Thorough • Be Fast • Be Corrective • Be Transparent
  • 9. @joe_Caserta#DGIQ2015 Completeness Versus Speed Data Quality SpeedtoValue Fast Slow Transparent Corrective
  • 10. @joe_Caserta#DGIQ2015 Corrective Versus Transparent • Corrective – Hides operational deficiencies – ETL complex algorithms – DW differs from OLTP – Slows ETL Processes • Transparent – Highlight Issues – Fast Delivery – DW matches OLTP – Forces source system cleanup
  • 11. @joe_Caserta#DGIQ2015 Data Quality Issues Policy • Category A Issues must be addressed at the data source • Category B Issues should be addressed at the data source even if there might be creative ways of deducing or recreating the derelict information • Category C Issues, for a host of reasons, are best addressed in the data- quality ETL rather than at the source • Category D Data-quality issues can only be pragmatically resolved in the ETL system
  • 12. @joe_Caserta#DGIQ2015 Data Quality Issues Bell Curve MUST be addressed at the SOURCE BEST addressed at the SOURCE BEST Addressed In ETL MUST be Addressed In ETL Category A Category B Category C Category D Political DMZ ETL Focus is here Universe of Known Data Quality Issues
  • 13. @joe_Caserta#DGIQ2015 Types of Data Quality Enforcement • Column Property Enforcement • Structure Enforcement • Data Enforcement • Value Enforcement
  • 14. @joe_Caserta#DGIQ2015 Column Property Enforcement • Null values in required columns • Numeric values that fall outside of range • Columns whose lengths are unexpected • Columns that contain data outside of allowed values • Adherence to a required pattern
  • 15. @joe_Caserta#DGIQ2015 Structure Enforcement • Consistent Data Types • Functional Dependencies • Referential Integrity • Hierarchical Relationships • Domain Sensibility
  • 16. @joe_Caserta#DGIQ2015 Data and Value Enforcement • Business Rules • Missing Data Values • Incorrect Data Values • Embedded Meanings in Data Values • Domain Redundancy
  • 17. @joe_Caserta#DGIQ2015 Data Quality Failure Options • 1. Pass the record with no errors • 2. Pass the record, flag offending column values • 3. Reject the record • 4. Stop the ETL job stream • 5. Fix on the Fly
  • 18. @joe_Caserta#DGIQ2015 Assessing Data Quality – It’s not as easy as it looks Data Quality Violation Action 1. Incoming Employee has a termination date earlier than their hire date 2. Compensation fact has currency that does not exist in the currency dimension 3. End Date is not a valid date 4. Bill Amount is 13,562,583.67 when bills usually don't exceed 1.3 million 5. The source for the region dimension contains a city 'New Yourk' 6. More than 90% of the prices are NULL while loading the Products dimension 7. The customer key is not available during the sales detail fact table load 8. Column is not found while attempting to extract the status of an employee 9. A product with existing facts has been deleted from the source system 10. The description is empty for a new product in the Product dimension
  • 19. @joe_Caserta#DGIQ2015 Tracking Data Quality Failures • Error Event Star- Schema – Enables trend analysis of errors and exceptions • Audit Dimension – Captures specific quality context of individual fact table records • Refer to The Data Warehouse ETL Toolkit pp.126-129 for more information on tracking data quality errors
  • 20. @joe_Caserta#DGIQ2015 Error Event Table Schema • Each error instance of each data quality check is captured • Implemented as sub-system of ETL • Each fact stored unique identifier of the defective source system record
  • 21. @joe_Caserta#DGIQ2015 Audit Dimension • Fact table contains a foreign key to audit key • Dummy (OK) row for records with no defects • Audit dimensions can be unique to each fact table • Error Event Fact can be used to fill in the measures of the audit dimension
  • 22. @joe_Caserta#DGIQ2015 Data Quality Strategy • 1. Perform Data Profiling • 2. Document Data Defects • 3. Determine Data Defect Responsibility • 4. Define Data Quality Rules • 5. Obtain Sign-off for Correction Logic • 6. Integrate rules with Logical Data Mapping
  • 23. @joe_Caserta#DGIQ2015 Enrollments Claims Finance ETL Horizontally Scalable Environment - Optimized for Analytics NoSQL Databases ETL Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… The Evolution of the Enterprise Data Hub Big Data Lake ETL
  • 24. @joe_Caserta#DGIQ2015 What’s Old is New Again  Before Data Warehousing DG/DQ  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Data Lake DG/DQ  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 25. @joe_Caserta#DGIQ2015 Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Enterprise Data Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  ETL cleans, conforms, consolidates, enriches each tier  Only top tier of the pyramid is fully governed Fully Data Governed ( trusted) User community arbitrary queries and reporting ETL/DQ ETL/DQ ETL/DQ ETL
  • 26. @joe_Caserta#DGIQ2015 Recommended Reading Ralph Kimball The Data Warehouse Lifecycle Toolkit, 2nd Edition Jack E Olson Data Quality, the Accuracy Dimension Ralph Kimball, Joe Caserta The Data Warehouse ETL Toolkit
  • 27. @joe_Caserta#DGIQ2015 Formal DW & ETL Training in NYC, 2015 Join us for one or both training courses combining two unique workshops from international data warehousing veterans. Workshops: Sept 21-22 (2 days), Agile Data Warehousing with Lawrence Corr Sept 23-24 (2 days), ETL Architecture and Design with Joe Caserta SAVE $300 BY REGISTERING BEFORE JUNE 30TH! Thanks! We look forward to seeing you there.
  • 28. @joe_Caserta#DGIQ2015 Thank You / Q&A Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta