Data protection and privacy regulations such as the EU’s General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and Singapore’s Personal Data Protection Act (PDPA) have been major drivers for data governance initiatives and the emergence of data catalog solutions. Organizations have an ever-increasing appetite to leverage their data for business advantage, either through internal collaboration, data sharing across ecosystems, direct commercialization, or as the basis for AI-driven business decision-making. This requires data governance and especially data asset catalog solutions to step up once again and enable data-driven businesses to leverage their data responsibly, ethically, compliantly, and accountably.
This presentation explores how data catalog has become a key technology enabler in overcoming these challenges.
1. IDMA 2021 Fall/Winter Conference
October 13th-14th, 2021
Data Catalog as a Business Enabler
Presented by Srinivasan Sankar
2. Disclaimer
Please note that the views expressed by our speakers are
their own and may not necessarily reflect those of their
respective employers.
This material is for general informational purposes only and
is not legal advice. It is not designed to be comprehensive,
and it may not apply to your particular facts and
circumstances.
3. TOPICS
• Improve insights by extracting value from unstructured data utilizing a machine
learning augmented data catalog
• Practical steps to deal with the onslaught of data and learn how to implement an
effective data catalog
• Overcoming data silos using intelligent tools
• Let the insights come to you with AI-augmentation
• Multi-source data to increase the potential of data value
• Data Catalog – key enabler of a Data Mesh
4.
5. NEW DATA, NEW INSIGHTS:
MAXIMIZING THE VALUE OF
YOUR STRUCTURED AND
UNSTRUCTURED DATA
6.
7. Definition
A data catalog creates and maintains an inventory of data assets through the
discovery, description and organization of distributed datasets. The data catalog
provides context to enable data stewards, data/business analysts, data engineers, data
scientists and other line of business (LOB) data consumers to find and understand
relevant datasets for the purpose of extracting business value.
In a nutshell,a data catalog is a place that shows what data assets you have and where they are
located.You might be asking,what is a data asset? That is any entity (i.e.,reports,databases,
websites) that contains data.
Data Catalogs Are the New Black in Data Management and Analytics
8.
9. • Leverage an ML-augmented data catalog as a first step in metadata management
• Deploy data catalogs with the capability to scale beyond narrow (or tactical) use-case
requirements (such as cataloging data only within a Hadoop distribution),
10. AI POWERED PROCESS FOR CURATING,
VERIFYING, AND CLASSIFYING DATA THAT
ENHANCES SPEED AND USABILITY
How does it work?
What is it?
Use Algorithms (Advanced Statistics and Deep
Learning) to learn from the large scale data to:
Applicable to large, complex and
often streaming data sets
3rd party data, sensor data, customer
data, transactions
• Algorithmic sampling of data to
identify key patterns and business
rules
• Continuous monitoring to alert Data Stewards of
exceptions for timely resolution
• Correlation of data concepts across domains
and data sources to track usage and establish
lineage
• Ability to ingest and apply quality rules to
third party and unstructured data sources
• Establishes feedback loop that refines the
machine learning models to improve data quality
over time
Identify patterns Quality issues and anomalies
across massive, complex and
often streaming data sets
Business rules
11. THE CASE FOR DATA CATALOGS
Analyze Data not chase Data – Many data scientists spend over 2/3rd of their time understanding and
finding the data.The main reason for this problem in an organization is the poor mechanism of handling
and tracking all the data. A good Catalog helps the Data Scientist or Business Analyst understand the
data and answer the question they have.
Efficient Access Control – When an organization grows, role-based policies are needed, don’t want
everybody to modify the data. Access Control should be implemented while building the Data Lake.
Roles are assigned to the users, and according to those roles, Data Access should be controlled.
Eliminate Data Redundancies – A good Catalogue Tool helped us find the data redundancies and
eliminate them.This can help us to save storage costs and data management costs.
To follow Laws – There are different protection laws to follow as per the data, such as GDPR, BASEL,
GDSN, HIPAA, and many more.These laws must be followed while dealing with any data. But these laws
stand for different usecases and don’t imply every data set, to understand that we need to know about
the data set. A good Catalog helps us make sure that Data Compliance’s followed by giving a view on
Data Lineage and using Access Control.
12. Phase
1
Catalog and
Lineage
• Infrastructure
and
Installation of
Catalog tool
• Data
Architects to
initiate the
collection of
data assets,
catalog and
identify
lineage
Phase
2
Data
Stewardship,
Business
Glossary
•Appoint Part-
time
Governance
Lead role
(cross-
functional
business facing)
•Supporting
Analyst
•Manage
Governance
activities
Phase
3
Operationalize
Governance
activities
•Accountability,
Ownership of
Data
•Operationalize
Data
Governance
activities
•Report Metrics
•Iterate
activities for all
information /
data projects
Improve / Enhance
Data Governance
HOW TO ADOPT DATA CATALOGS
Manage Data Lifecycle
Establish
Data Governance
Sustain Data Governance
Communicate
Manage Return
On Investment
Maintain Organization &
Sponsorship
Review/Update Processes
Review//Update Scope
(Quarterly Workshop)
Business Change
Management
Review & Approve New Projects
Maintain Data Definitions
Maintain Metrics
Identify Data Stewards
Conflict Resolution, Escalation
Plan
Organize
Organize
Define
Deploy
Core Foundation
Augmented Data Catalog*
* Machine learning powered process for curating, verifying, and classifying data that enhances speed and usability
Phased approach
Data Cataloging is a journey……
13. DATA
CATALOG
BEST
PRACTICES
Assigning Ownership for the data set – Ownership of
each data set must be defined.There must be a person
to whom the user contacts in case of an issue. A good
Catalog also must talk about the owner of any data set.
Human Touch – After building a Catalog, the users must
verify the data sets to make them more accurate.
Searchability –The Catalog should support searchability.
Searchability enables Data Asset Discovery; data
consumers easily find assets that meet their needs.
Data Protection – Define Access policies to prevent
unauthorized data access.
14. HIGH ROI FOR MULTI-SOURCE DATA
WITH DATA CATALOG
Graphic
Source:
CEB
analysis
Weather,
Highway safety
Industry
Enterprise Data Integration and Data Lake
Single source data has value in relation to other data in the organization, and the ability
to search and analyze across multiple information sources provides tremendous insight
Traditional DW
•Driving Tracker
•Nest Protect
•GPS Fleet
Tracking
D
A
T
A
C
A
T
A
L
O
G
15. DATA CATALOG
THE NUCLEI OF A DATA MESH*
• A data product must be easily discoverable
especially with a data catalogue, with their meta
information such as their owners, source of origin,
lineage, sample datasets, etc.This centralized
discoverability service allows data consumers,
engineers and scientists in an organization, to find a
dataset of their interest easily. Each domain data
product must register itself with this centralized
data catalogue for easy discoverability.
• Note the perspective shift here is from a single
platform extracting and owning the data for its use,
to each domain providing its data as a product in a
discoverable fashion.
• Data catalog platforms provide central
discoverability, access control and governance of
distributed domain datasets.
*Data Mesh (concept founded by Zhamak Dehghani) is a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments - within or across organizations