More Related Content Similar to RWDG Slides: How to Govern Data Lakes (20) More from DATAVERSITY (20) RWDG Slides: How to Govern Data Lakes1. 1
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
How to Govern Data Lakes
with Special Guest Evan Terry
Monthly Webinar Series Hosted by DATAVERSITY
Robert S. Seiner – KIK Consulting / TDAN.com
July 18, 2019 – 11:00 a.m. PT / 2:00 p.m. ET
Real-World Data Governance
3. 4 big trends driving the need for a new architecture
Separation of
Compute &
Storage
Hybrid – Multi
cloud
environments
Self-service
data across the
enterprise
Rise
of the object
store
6. Alluxio’s Approach to Big Data Federation
Unified Access - Acts as a “virtual data lake.” Files are accessed in Alluxio’s
global namespace as if they resided in a single system
Performant - Provides fast local access to important and frequently used data,
without maintaining a permanent copy of all data.
Modern, flexible architecture - Promotes separation of compute from storage
Storage Cost Optimization -Transparently reads and writes data directly
from the source system, and so does not need to create a permanent copy of
the data
7. Data Elasticity
with a unified
namespace
Abstract data silos & storage
systems to independently scale
data with compute
Run Spark, Hive, Presto, ML
workloads on your data
located anywhere
Accelerate big data
workloads with transparent
tiered local data
Data Accessibility
for popular APIs &
API translation
Data Locality
with Intelligent
Multi-tiering
Key Innovations of the Data Orchestration Layer
8. Use Cases Data Orchestration Enables
Hive
Alluxio
Run big data workloads in hybrid
cloud environments
On premise
Same instance
/ container
Spark
Alluxio
Any Cloud / Multi Cloud
Same data
center / region
PrestoSpark
Alluxio
Accelerate big data frameworks
on the public cloud
Same instance
/ container
Enable big data on object stores
across single or multiple clouds
Standalone
9. Incredible Open Source Momentum with growing community
900+ contributors &
growing
3760+ Git Stars
Apache 2.0 Licensed
Hundreds of thousands
of downloads
Join the conversation on Slack
alluxio.org/slack
10. 2
2
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Real-World Data Governance – Monthly Webinar Series
– August 15, 2019 – Data Governance versus Information Governance
– Third Thursday each Month @ 2pm EST – Register at TDAN.com, KIKconsulting.com, DATAVERSITY.net
• Non-Invasive Data Governance Book
– ISBN 9781935504856 / Technics Publishing / Amazon.com
• Speaking @ Dataversity Events
– Data Architecture Summit, Chicago – October 14-17
– Data Governance Vision, Washington, DC – December 9-12
• Non-Invasive Data Governance Online Learning Plan
Non-Invasive Metadata Governance Online Learning Plan
– DATAVERSITY Training Center
– https://training.dataversity.net
• The Data Administration Newsletter (TDAN.com)
– Twice Monthly – Data Articles, Columns, Blogs and Features
– Produced by DATAVERSITY – Subscribe for emails
– New Non-Invasive Data Governance Framework now being published
• KIK Consulting & Educational Services
KIKConsulting.com
Home of Non-Invasive Data Governance™
– Home of Non-Invasive Metadata Governance
How to Govern Data Lakes
Introduction
11. 3
3
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
Chief Analytics Officer, Velocity Mortgage Capital
Evan brings over 20 years of consulting experience in IT environments, including leading
software development projects, designing and implementing IT and data strategies, and working
on long term, cross departmental projects in such diverse industries as automotive, retail, state
government, and e-commerce payments.
Evan’s areas of expertise include designing practical analytics solutions, aligning business and IT
strategies, and implementing data management and governance programs.
He co-authored the data modeling book Beginning Relational Data Modeling and has spoken
about data and process quality and systems design. Evan has a BA in Economics from McGill
University and an MBA from Columbia Business School.
How to Govern Data Lakes
Special Guest Evan Terry
12. 4
4
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• In this webinar, Bob and Evan will discuss:
– The relationship between Data Lakes and Data Governance
– Preventing your Data Lake from becoming a Data Swamp
– Governing the Metadata associated with your Data Lake
– Leveraging governed data to provide trustworthy Analytics
– Measuring the value of a governed Data Lake
How to Govern Data Lakes
Abstract
13. 5
5
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• What is Data Governance?
– The execution and enforcement of authority over the
definition, production and usage of data and data-related assets.
Robert S. Seiner
– The management and organization of data.
Evan Terry
– The orchestration of people and process and data.
– The harmonization of people and process and data.
– The formalization of accountability for data.
– The implementation of decision-rights for data.
How to Govern Data Lakes
The Relationship between Data Lakes and Data Governance
14. 6
6
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• What is a Data Lake?
– A data lake is a system or repository of data stored in its natural/ raw format,
usually object blobs or files.
– A data lake is usually a single store of all enterprise data including raw copies
of source system data and transformed data used for tasks such as reporting,
visualization, advanced analytics and machine learning.
SAS Article, 2016
• When does a Data Lake become a Data Swamp?
– A data swamp is a deteriorated and unmanaged data lake that is either
inaccessible to its intended users or is providing little value.
Olavsrud, Thor. CIO 2017
– When the data in the lake is ungoverned.
How to Govern Data Lakes
The Relationship between Data Lakes and Data Governance
15. 7
7
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• A connection between governance (how to manage and organize) and data lakes
for accurate and useful data management
• Catalogs are critical to help you govern data, especially in data lakes
– Find things
– Defining things
– Curate content
• Need to include policy-driven processes that classify and identify the information
in the lake, why it’s in there, what it means, who owns it, and who is using it
• A data lake without data governance will ultimately end up being a collection of
disconnected data pools or information silos—just all in one place.
How to Govern Data Lakes
The Relationship between Data Lakes and Data Governance
16. 8
8
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• What can be done to prevent the swamping of your data lake?
– Implement data governance for the lake.
– Implement metadata management for the lake.
– Implement sound principles of:
• Data Definition
• Data Production
• Data Usage
• What is the appropriate level
of data governance for your
data lake?
How to Govern Data Lakes
Preventing your Data Lake from becoming a Data Swamp
17. 9
9
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• A “data lake” becomes a data swamp without organization
– No organization, no curation of content, little metadata
• Data warehouse principles are relevant:
– Stewardship/Curation
– Design, documentation, maintenance of the lake
– Metadata capture
– Governance
• Technique - Create zones in your data lake:
– Transition data sets from “raw data” to “clean data”
– Apply different curation/governance principles to each zone
How to Govern Data Lakes
Preventing your Data Lake from becoming a Data Swamp
18. 10
10
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Governing metadata associated with:
– Data Definition
– Data Production
– Data Usage
• (Where) Is there metadata associated with your data lake?
• Who is responsible for the metadata associated with your data lake?
• “The metadata will not govern itself!”
How to Govern Data Lakes
Governing the Metadata Associated with your Data Lake
19. 11
11
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Cataloging is key, but is tricky:
– don’t under/over catalog
– don't be too loose/rigid in your governance rules
• “Goldilocks” mentality – everything in moderation
• Tune governance to priorities and context
– One person's data lake is another’s data swamp
– Don't turn data lake into a data warehouse – the clearest data lake
– Cannot be all things to all people – playground, incubator, or operational
data store?
How to Govern Data Lakes
Governing the Metadata Associated with your Data Lake
20. 12
12
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Sample DG purpose statement – Use strategic data with confidence.
• Make certain the water is clean or it may be unhealthy.
• “Boil water alert” – Is data governance the boiling of the water?
• “Freshwater” versus “Saltwater”
determines species that will
live in your lake.
How to Govern Data Lakes
Leveraging Governed Data to Provide Trustworthy Analytics
21. 13
13
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Data catalogs solve the problems of finding, interpreting and using data
• Data lake is a tool and the context is key – differences in required data quality
• “Trustworthy” depends on context and accuracy needs – data lakes are defined
as “less” controlled and structured
How to Govern Data Lakes
Leveraging Governed Data to Provide Trustworthy Analytics
22. 14
14
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Provides much the same value as for a data warehouse – analytics requires:
– Who owns the data and can answer questions about it
– Finding the right data elements that meet your needs
– Cleaning the data to an appropriate level of quality
– Having the right security on the data being used
– Monitoring the data for adherence to standards
• Lightweight governance on adding, naming, organizing protects the shared
resource from the “tragedy of the commons”
How to Govern Data Lakes
Leveraging Governed Data to Provide Trustworthy Analytics
23. 15
15
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Metrics are one of the 6 core components of Data Governance.
Data, people, process, communications, metrics and tools.
• Measuring people’s ____________ the data in the lake.
– confidence in
– understanding of
– usage of
– decisions made using
– knowledge of what data resides in
– … all will depend on the effective management
of metadata associated with your data lake.
How to Govern Data Lakes
Measuring the Value of a Governed Data Lake
24. 16
16
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Considerations for providing metrics
– Benchmark current status
– Select metrics that mean something to someone
– Select metrics associated with the data lake rather than data governance
– Consider that it is not easy to measure Return on Investment on DG
– Go jump in the lake!
How to Govern Data Lakes
Measuring the Value of a Governed Data Lake
25. 17
17
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Unlocking the value depends on the data lake being broadly usable
• What is the value of R&D? What is the value of avoiding a disaster?
• The context of the data lake is key
– What is the purpose of the data lake?
– What is the tool the data lake will help you solve?
– How much value does governance (lightweight or not) provide?
• Value is measured in combination with the final use
– AI/Machine Learning
– Agility/Time to Market
– Variety of end users served/capabilities enabled
How to Govern Data Lakes
Measuring the Value of a Governed Data Lake
26. 18
18
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• In this webinar, Bob and Evan discussed:
– The relationship between Data Lakes and Data Governance
– Preventing your Data Lake from becoming a Data Swamp
– Governing the Metadata associated with your Data Lake
– Leveraging governed data to provide trustworthy Analytics
– Measuring the value of a governed Data Lake
How to Govern Data Lakes
Abstract
27. 19
19
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Questions and Answers
Real-World Data Governance
Contact Information
Join us in the Dataversity Community to continue the conversation.
https://community.dataversity.net/
28. 20
20
Copyright © 2019 Robert S. Seiner – KIK Consulting & EducationalServices / TDAN.com
Non-InvasiveData Governance™ is a trademark of Robert S. Seiner & KIK Consulting
#RWDG @RSeiner
• Robert S. Seiner
KIK Consulting & Educational Services – KIKconsulting.com
The Data Administration Newsletter – TDAN.com
Post Office Box 112571, Upper St. Clair, Pennsylvania 15241
412.220.9643, 412.220.9644 (Fax)
rseiner@kikconsulting.com
rseiner@tdan.com
@RSeiner @TDAN_com
#RWDG
Real-World Data Governance
Contact Information