This document discusses metadata and the importance of metadata management. It introduces Apache Atlas as an open source platform for metadata management and governance. Key points include:
- Metadata is important for data reuse, analytics, and governance. It provides context and meaning about data.
- Current reality is that metadata is often not well supported or integrated across tools. Apache Atlas aims to provide an open, unified approach.
- Apache Atlas has graduated to a top-level Apache project. It provides a type-agnostic metadata store and interfaces that can be accessed by various tools.
- The vision is for an open ecosystem where metadata is shared and federated across repositories from different vendors and tools.
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Apache Atlas Open Innovation Platform
1. Mandy Chessell CBE FREng CEng FBCS
Distinguished Engineer, Master Inventor
Analytics Chief Data Office
mandy_chessell@uk.ibm.com
18th April 2018
Good analytics needs good data and
that needs good metadata
2. Apache Atlas as an open innovation platform for metadata management and governance3
Agenda
Why is metadata so important today?
What is the challenge?
Building an open ecosystem
Apache Atlas and the specifics
ODPI Data Governance PMC
Progress report and call to action
3. Apache Atlas as an open innovation platform for metadata management and governance4
Open Data
Site
The perils of reusing data …
Data Lake
Employee
Directory
Callie Quartile uses (1) open data
from the local government registrar
and (2) data from the employee
directory to (3) create a birthday
card service for the company.
Callie Quartile
Data Scientist
1
3
2
4. Apache Atlas as an open innovation platform for metadata management and governance5
Open Data
Site
The perils of reusing data …
Data Lake
Employee
Directory
Callie Quartile
Data Scientist
1
3
2
Happy
Birthday
But its not my
birthday
Unfortunately the obvious date in the
registrar record was the registration of
birth date not the date of birth. Date
of birth was not published in the open
data.
Callie needed better information about
the open data to realise she had the
wrong data.
5. Apache Atlas as an open innovation platform for metadata management and governance6
Metadata
should bring
as much
information
about the
data sets to
Callie’s data
science as is
known
collectively
by the
organization.
Employee Directory
NameBand Job Title
X
Data Set Name: Employee
Directory
X
Description:
Core attributes describing all
employees of OCO
pharmaceuticals created from a
daily extract from Kenexa.
Owner: Penny Payer
Status:
Last accessed: 6th May 2016
Records: 3488
Last Update: 1st May 2016
Contents:
Structure …
Contents …
Lineage …
XColumn:
Band
Classification Ranges:
Confidentiality: Public, Confidential,
Sensitive
Confidence: Authoritative
Retention: Indefinitely
Characteristi
cs
LineageDescription
Position reference number for non-
exempt employees. The value ranges
from 01 to 06 where 01 is the most senior
and 06 is the most junior.
Type: String
Classification: Public
6. Apache Atlas as an open innovation platform for metadata management and governance7
Different personas need different services
Callie Quartile
Data Scientist
Jules Keeper
Chief Data Officer
Find data
Understand data
Manage analytics models
Build data strategy
Define governance program
Monitor progress
7. Apache Atlas as an open innovation platform for metadata management and governance8
Different personas need different services
Faith Broker
HR and Privacy Officer
Gary Geeke
IT
Locate personal data
Ensure protection of personal data
Understand employee needs
Maintain “safe” IT Infrastructure
Build and deploy “good” APIs and services
Locate and resolve issues fast
8. Apache Atlas as an open innovation platform for metadata management and governance9
Different personas need different services
Tanya Tidie
Clinical Trials Administrator
Ivor Padlock
Chief Security Officer
Maintain accurate patient records
Catalog clinical trials data
Demonstrate good data management practices
Understand risks to organization
Set up protection
Monitor for suspicious activity
9. Apache Atlas as an open innovation platform for metadata management and governance10
Scope of metadata for a data driven organization
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
Base Types, Systems
and Infrastructure
10. Apache Atlas as an open innovation platform for metadata management and governance11
Curation
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
I know
I wonder
what this
means
11. Apache Atlas as an open innovation platform for metadata management and governance12
Scared to share
Faith Broker
Business Team
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
Faith Broker has been doing some simple analysis
on the HR data of the company. She wants to share
this data with Callie Quartile to do some detailed
work. However, she does not want Callie to see the
sensitive personal information in the record.
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 XXXXX XXX 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 XXXXX XXX 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 XXXXX XXX 27 Code St Harlem NY 1 3
Callie Quartile
Data Scientist
12. Apache Atlas as an open innovation platform for metadata management and governance13
Business
metadata
Structural
metadata for
a data store
Using glossary function for semantic processing
EMPNAME EMPNO JOBCODE SALARY
EMPLOYEE
RECORD
Employee
Work Location
Annual Salary
Job Title
Employee Id
Employee Name
Hourly Pay Rate
Manager Compensation Plan
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
IS-A IS-A
Sensitive
IS-A
Data
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
13. Apache Atlas as an open innovation platform for metadata management and governance14
Why do we need metadata?
Metadata enables data to be used outside of the application that created it.
• Analytics and decision making
• New business applications
• Reporting and compliance
Metadata describes the format and content of data allowing people to judge which data set
to use for a new project
• Structure
• Meaning
• Origin
• Valid values and quality
• Usage and ownership
• Regulations and classifications that apply
• <more>
Metadata describes the business context and classification of data allowing automated
governance processes to operate.
14. Apache Atlas as an open innovation platform for metadata management and governance15
Today’s reality
Many data platforms do not have metadata support
Proprietary tools support a range of data sources and governance actions
• No-one supports everything you need and assumes all tools come from their suite
• Each tool starts “empty” requiring effort to populate metadata
• Each tool operates as if it is the only tool
• No integration/interoperability of metadata repositories from different vendors
Expensive efforts to create an enterprise data catalogue
15. Apache Atlas as an open innovation platform for metadata management and governance16
Today’s reality
16. Apache Atlas as an open innovation platform for metadata management and governance17
Manual metadata capture
17. Apache Atlas as an open innovation platform for metadata management and governance18
Automatic metadata capture
18
18. Apache Atlas as an open innovation platform for metadata management and governance19
What needs to change?
Open and
Unified Metadata
19. Apache Atlas as an open innovation platform for metadata management and governance20
A new manifesto for metadata and governance
Metadata management must be automated
Metadata management must become ubiquitous
Metadata must become open and remotely accessible
Metadata should be used to drive the governance of data
The discovery, maintenance and use of metadata has to be an integral part
of all tools that access, change and move information.
20
20. Apache Atlas as an open innovation platform for metadata management and governance21
Open metadata management ecosystem
Peer-to-peer network of repositories
Metadata stored and managed close
to its source
Each repository/tool brings unique
value.
Open, extensible metadata structures
for metadata exchange and federation
– extending coverage of the types of
resources that need to be described.
Open source infrastructure sharing
cost of development and maintenance
between vendors
Support for open standards where
available
Collaboration
Space Metadata
Analytics Platform
Metadata
Application
Metadata
Cloud SaaS platform
Metadata
Hadoop Platform
Metadata
21. Apache Atlas as an open innovation platform for metadata management and governance22
Apache Atlas
http://atlas.apache.org/
Apache Atlas has just graduated to become a top-level project.
It began as an incubator open source project on 5th May 2015 to deliver an
open source governance capability focused primarily on the Hadoop platform.
Apache Atlas is designed to localize operational governance to the operating
data platform such as Hadoop.
At its heart is a type-agnostic metadata store that can be access through restful
interfaces.
We see Apache Atlas as the reference implementation for open metadata and
governance, for vendors to pick up and use; or test their integration against.
Being open source allows all vendors to enrich/enhance standard.
22. Apache Atlas as an open innovation platform for metadata management and governance23
Apache Atlas today
23. Apache Atlas as an open innovation platform for metadata management and governance24
Updates to Apache Atlas Automation
• Capture of metadata from data platforms,
data movement engines and data
protection engines.
• Exception management and stewardship
Business Value
• Specialized services for key data roles
such as CDO, Data Scientist, Developer,
DevOps Operator, Asset Owner,
Applications
Connectivity
• Metadata Highway offering open
metadata exchange, linking and
federation between heterogeneous
metadata repositories.
24. Apache Atlas as an open innovation platform for metadata management and governance25
Taking guidance from existing metadata standards
Well-defined
Complementary
Integrating
Decoupled
https://www.w3.org/TR/vocab-dcat/
25. Apache Atlas as an open innovation platform for metadata management and governance26
Instance representations in the graph
26. Apache Atlas as an open innovation platform for metadata management and governance27
Open metadata meta-types, types and instances
«relationship»
DataContentForDataSet
*
*
dataContent
supportedDataSets
«entity»
DataSet
createTime : date
modifiedTime : date
«entity»
DataStore
«entity»
Asset
«entity»
GlossaryTerm
«entity»
Referenceable
description : string
expression : string
status : TermAssignmentStatus
confidence : int
steward : string
source : string
«relationship»
SemanticAssignment
*
*
assignedElements
meaning
27. Apache Atlas as an open innovation platform for metadata management and governance28
Open metadata type model summary
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
4
3
1
5
2
6
7
Base Types, Systems
and Infrastructure
0
28. Apache Atlas as an open innovation platform for metadata management and governance29
Open metadata type model summary
Policy Metadata (Principles,
Regulations, Standards,
Approaches, Rule Specifications,
Roles and Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Teaming Metadata
(people profiles,
communities, projects,
notebooks, …)
Models and Schemas
4
3
1
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Rollout
2
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
O-DEF
O-BDL
ConnectorsBasic Types, Infrastructure and Systems
Access
0
29. Apache Atlas as an open innovation platform for metadata management and governance30
More detail here …
https://cwiki.apache.org/confluence/display/ATLAS/Building+out+the+Open+Metadata+Typesystem
30. Apache Atlas as an open innovation platform for metadata management and governance31
Metadata and governance digital platform
Open Metadata
and Governance
Reporting
Platform
ETL Platform
Analytics
Platform
Virtualization
Platform
Governance
Platform
Data
Platform
31. Apache Atlas as an open innovation platform for metadata management and governance32
Types of tools that may integrate with an open metadata
repository
BI and visualization tools
• locating data assets and related information about them; defining
reports and publishing their metadata; viewing lineage
Data Science tool
• wanting to find out about data assets available and manage user
lineage of transformations and analytics models – may also manage
metadata for analytics models
API developer tool
• wanting to understand proper data structures and data meaning to
use for APIs – plus additional governance requirements that need to
be implemented by API because of the data it exchanges.
Counter-fraud tools
• ad hoc analysis of logs and error reports, setting up rules
Curator/owner tool
• for managing the curation of assets, providing access, verifying use of
assets, reviewing discovery results and exceptions, approving change
requests.
Glossary tool
• for subject matter experts and information architects to share
expertise about a particular subject area – may also define structures
and related reference data
Enterprise architect tools
• defining the data landscape and related systems.
DevOps tools
• conformance to polices and standards in development
• metadata capture at deployment
• validation of deployment platform requirements
Data integration engine
• locating appropriate data and component assets, log design lineage,
log operational lineage
Information Virtualisation tools
• locate appropriate data assets, build views and publish them, add
design lineage, log operational lineage
Governance tools
• setting up and monitoring governance program, data quality, …
Stewardship tools
• reviewing assigned exceptions, making data changes and requesting
approval
Information security tools
• setting up data access policies and enforcement
Auditor tools
• view compliance reports and validate policies and policy
implementations
32. Apache Atlas as an open innovation platform for metadata management and governance33
Open Metadata Access Services
Project Management
Community ProfileAsset Catalog
Stewardship Action
Information View
Governance Program
Information Process
Subject Area
Connected Asset Discovery
Governance Engine
Information Protection
Developer
Data Platform
Asset Owner
Information Landscape
Data Science
DevOps
Asset Consumer
Information
Infrastructure
33. Apache Atlas as an open innovation platform for metadata management and governance34
OMAS service instance
Both call API and notifications
34. Apache Atlas as an open innovation platform for metadata management and governance35
Inside the server
Open Metadata and Governance (OMAG) Server
Open Metadata Access Services (OMAS)
OMRS Topic
Connector
OMRS Cohort
Registry Store
Connector
OMRS Archive
Connector
OMRS
AuditLog
Connector
OMRS Event
Mapper
Connector
OMRS
Repository
Connector
Server
Configuration
OMAS REST APIs
and Topics
OMAG
Administration
REST APIs
OMRS
Repository
REST APIs
Open Metadata Repository Services (OMRS)
35. Apache Atlas as an open innovation platform for metadata management and governance36
Inside the server
Open Metadata and Governance (OMAG) Server
Open Metadata Access Services (OMAS)
OMRS Topic
Connector
OMRS Cohort
Registry Store
Connector
OMRS Archive
Connector
OMRS
AuditLog
Connector
OMRS Event
Mapper
Connector
OMRS
Repository
Connector
Server
Configuration
OMAS REST APIs
and Topics
OMAG
Administration
REST APIs
OMRS
Repository
REST APIs
Administration
Enterprise Repository Services
Local Repository
Services
Cohort
Services
36. Apache Atlas as an open innovation platform for metadata management and governance37
Integration patterns
https://cwiki.apache.org/confluence/display/ATLAS/Integrating+into+the+Open+Metadata+and+Governance+Ecosystem
IBM Information
Governance Catalog
Apache
Atlas
37. Apache Atlas as an open innovation platform for metadata management and governance38
Caller Pattern
A metadata tool can access the
consumer-specific APIs to work
with metadata.
The Access Layer handles the
calls to metadata repositories
connected to the metadata
highway
38. Apache Atlas as an open innovation platform for metadata management and governance39
Native Pattern
Native
implementation of
the open
metadata
governance APIs
Apache Atlas is a
native
implementation of
the open
metadata and
governance APIs.
39. Apache Atlas as an open innovation platform for metadata management and governance40
Adapter Pattern
Simple
components plug
into a repository
proxy to connect
in an existing
metadata
repository.
40. Apache Atlas as an open innovation platform for metadata management and governance41
Plug-in Pattern
Open Connector Framework (OCF)
• Connectors to data, analytics etc
Open Discovery Framework (ODF)
• Metadata discovery services
Governance action Framework (GAF)
• Stewardship services for triage and
remediation of exceptions
41. Apache Atlas as an open innovation platform for metadata management and governance42
IBM Unified Governance
42. Apache Atlas as an open innovation platform for metadata management and governance43
Simple cohort
Cohort A
Chief Data Office
Data Lake
Systems of Record
43. Apache Atlas as an open innovation platform for metadata management and governance44
Multiple Cohorts
Cohort BCohort A
Chief Data Office
Data Lake
Systems of Record
Mobile
Apps
Data
Lake
Systems of
Record
Marketing
44. Apache Atlas as an open innovation platform for metadata management and governance45
First server
45. Apache Atlas as an open innovation platform for metadata management and governance46
Establishing contact
46. Apache Atlas as an open innovation platform for metadata management and governance47
Federated queries
47. Apache Atlas as an open innovation platform for metadata management and governance48
Caching metadata for availability and performance
48. Apache Atlas as an open innovation platform for metadata management and governance49
ODPI - co-creation with practitioners
• Compliance assistance and certification
for vendors
• Subject matter experts sharing best
practices and co-creating content packs
https://github.com/odpi/data-governance
49. Apache Atlas as an open innovation platform for metadata management and governance50
• Your governance program is based on
established practices and definitions
• Allows a broader range of tools in your
organization
• Automated governance processes
protect and manage your data
Your metadata offerings will deliver value
faster as they tap into metadata collected by
other vendor’s tools.
ODPi packages extend your metadata
system’s and tools’ capabilities
Conformance tests minimize your effort in
being compliant with key standards and
regulations.
Customers have increased confidence in your
tools and services due to ODPi certification.
Data Governance Professionals
Vendors
How ODPi Helps
50. Apache Atlas as an open innovation platform for metadata management and governance51
Summary
Big data is creating new opportunities and requirements that needs new types
of systems. Data Lakes are just one part of this story.
Metadata is critical to make the best use of this data for the widest range of
scenarios.
Most organizations use tools and platforms from many vendors.
Open standards have had limited take-up
Can we use open source to create a digital platform that allows vendors to take
advantage of metadata from a broader ecosystem?
• Open Metadata and Governance defines the standards
• Apache Atlas provides the reference implementation
• ODPi helps to build the ecosystem
51. Apache Atlas as an open innovation platform for metadata management and governance52
Call to action – how can you help?
Direct contribution to the Apache Atlas and/or ODPi Data Governance projects.
• There are many features that still need to be developed.
Encouraging your vendors/partners and projects internal to your organization
to embrace the Open Metadata and Governance standards to grow the
ecosystem of data and processing that is assured by metadata and governance
capability.
52
52. Apache Atlas as an open innovation platform for metadata management and governance53
https://cwiki.apache.org/confluence/display/ATLAS/Atlas+Projects
53. Apache Atlas as an open innovation platform for metadata management and governance54
zzzz
z
z
z
Questions?
Notes de l'éditeur
Business metadata describes the data that the business needs, what it means and how it should be classified and protected.
Structural metadata describes how the data is actually stored and labelled in the data store.
The linkage between the business and technical metadata allows our technology to switch between these two perspectives. For example,
A request for data expressed in business terminology can be translated into a query for data from a data store.
An integration engine copying data into a sand box can discover which are the fields that the business classifies as sensitive and then mask these values dynamically.
AUTOMATED – Metadata is created by application at the same as the data is created in a standard manner easily consumable for all with necessary permissions
Device that took the picture / name of picture / settings picture was taken at / location geo tag of picture etc – all automatic – all done at creation of data time
The maintenance of metadata must be automated to scale to the sheer volumes and variety of data involved in modern business.
Metadata management must become ubiquitous in cloud platforms and large data platforms, such as Apache Hadoop so that the processing engines on these platforms can rely on its availability and build capability around it.
Metadata access must become open and remotely accessible so that tools from different vendors can work with metadata located on different platforms. This implies unique identifiers for metadata elements, some level of standardization in the types and formats for metadata and standard interfaces for manipulating metadata.
Metadata should be used to drive the governance of data and create a business friendly logical interface to the data landscape.
Wherever possible, discovery and maintenance of metadata has to an integral part of all tools that access, change and move information.