ODPi Egeria provides a framework for open metadata management, supporting many use cases in the governance of data in Data Lakes – as described in the Egeria webinar on 2nd June: “Data Lake Design with Egeria”. As described in that webinar, Egeria operates across the Data Lake, without needing centralization of metadata from the different tools into a central tool or repository.
Metadata frequently describe relationships between things like Assets, Schemas, Glossaries, and Terms – and these relationships form graphs. Egeria is distributed in nature, enabling you to see a federated view of the metadata contained in multiple tools and metadata repositories. As a result, the discrete graphs naturally federate to form a distributed graph. In this session, we’ll cover the Open Metadata Repository Services (OMRS) layer that enables Egeria to operate across this distributed graph.
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Egeria and graphs
1. Egeria and Graphs
Graham Wallis, August 2020
Graham Wallis is an open-source developer and maintainer on the ODPi Egeria project. He has worked with graph-
related technologies for about 5 years, so he doesn’t have all the answers but hopes you find this presentation
interesting and useful.
4. • In a commercial setting, metadata is used to describe:
• database records and schemas, files and file formats, documents, models, …
• systems, applications, processes such as ETL, archiving, analytics, …
• business concepts as glossaries of terms and their semantic assignments
• In typical commercial organizations:
• the data landscape is vast and distributed
• data is dispersed across multiple data lakes managed by different parts of an organization
• multiple tools from different vendors are used to load, access and manage the data
• multiple tools are used to analyze the data
Commercial metadata and governance
6. • Organizations need a business-friendly logical interface to the data landscape. This implies that
the organization develop a common business vocabulary or glossary.
• Organizations need governance of data to be driven by the metadata, requiring that the metadata
is accurate and up-to-date.
• The maintenance of metadata must be automated to scale to the volumes and variety of data
involved in modern business.
• The metadata must be available across different tools and platforms so that processing engines
can build capability around it.
• Wherever possible, discovery and maintenance of metadata must be an integral part of tools that
access, change and move information.
• Metadata access must become open and remotely accessible so that tools from different vendors
can work with metadata located on different platforms.
• This implies unique identifiers for metadata elements, some level of standardization in the types
and formats for metadata and standard interfaces for accessing and manipulating metadata.
Commercial metadata and governance
8. • ODPi Egeria is an open source project dedicated to making metadata open and automatically
exchanged between tools and data platforms
• Egeria provides an Apache 2.0 licensed platform to enable users and vendors to create an open
ecosystem for metadata
• Egeria arose from several years work by Mandy Chessell (IBM), Ferd Scheepers (ING Bank) and others,
on data lakes, data governance & common information models
• Egeria is hosted by the Linux Foundation ODPi project (Open Data Platform Initiative): egeria.odpi.org
• The code is on Github: github.com/odpi/egeria
• The Egeria community includes IBM, ING Bank, Manta and SAS plus contributions and interest from
other organizations and individuals.
Egeria Project & Community
10. 11
Egeria enables exchange of metadata between tools from different vendors
Open and
Unified Metadata
Development DevOps Data Science
11. Egeria Servers and Cohorts
Cohort Cohort
External
Tool/Repository
Egeria
Server
Egeria
Server
Egeria
Server
Egeria
Server
Egeria
Repository
Egeria
Repository
Egeria
Server
A server may have a repository or may support a
given tool or external repository.
A server may join multiple cohorts.
Applications
Applications
13. Graphs in Metadata
Business
metadata
Structural
metadata for
a data store
EMPNAME EMPNO JOBCODE SALARY
EMPLOYEE
RECORD
Employee
Work Location
Annual Salary
Job Title
Employee Id
Employee Name
Hourly Pay Rate
Manager Compensation Plan
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
IS-A IS-A
SensitiveIS-A
Data
• The interconnected nature of metadata forms a graph
• The business concepts associated with the data form a graph of terms and classifications
14. Graphs in Metadata
• Different tools or databases gives rise to graphs at both business and technical levels
15. Querying across graphs…
• Enterprise integration and queries require that we can query across graphs and
between business and technical metadata
16. Parallels between graphs…
• The graph of artifacts in a Discovery Analysis Report mirrors the graph of schema elements
17. • As seen from the foregoing examples (of different tools, business and technical metadata, discovery analysis
reports) there are many graph-like structures in metadata
• Egeria is therefore based on graphs and graph-like approaches; it includes a graph repository and graph-
based tooling
• The Open Metadata Types form graphs - an entity type inheritance graph and a graph of the possible
relationship types for an entity type
• We also see graphs in glossary structure (glossary, terms, categories) as well in the semantic assignment of
glossary terms to metadata instances
• Metadata instances (entities, relationships and classifications) are organized as graphs and can be queried
using graph traversals
Graphs in Egeria
18. • Within the Egeria integration UI:
• The Type Explorer can be used to visualize entity type inheritance and entity type relationship graphs
• The Repository Explorer can be used to explore graphs of entities and relationships across repositories
• The Admin UI shows the deployed topology of Egeria platforms, servers and cohorts
Egeria UI graph visualizations
19. • Egeria can transparently federate metadata from multiple repositories, giving rise to a distributed graph
• Entities in different repositories can be related by a relationship in either repository or a further repository
• Entities and relationships in different repositories can be queried and traversed as if they were collocated
• Egeria’s federation capability avoids the need to move or copy metadata
• Ownership remains with the current owner
• There is no duplication, or risk of updates being applied to a copy of the metadata
• Egeria can create a local reference copy of a remote instance, as a locally cached copy, but ownership of the
metadata remains with the tool and repository that created it. Updates are only permitted on the owner’s
original, not on the copies
• When an Egeria user accesses a remote instance, the Egeria server will register interest in the remote
instance
• If the remote instance is modified or deleted, any registered Egeria servers receive events, delivered to the
access services that triggered the interest
• Ownership of an instance can be transferred if necessary
Egeria federation (a distributed graph)
20. Egeria distributed graph model
21
Database
Column
Glossary
Term
OMAG Server 1 OMAG Server 2
§ A pair of entities may be stored in separate servers
21. Egeria distributed graph model – using reference copies
22
Database
Column
Glossary
Term
Glossary
Term
Meaning
OMAG Server 1 OMAG Server 2
§ One entity could be replicated to the other server, as a ‘reference copy’
§ The original Glossary Term on OMAG Server 2 is still the authoritative instance; the copy cannot be updated
§ A relationship could be defined between the local DB column and the reference copy of the Glossary Term
Reference Copy
22. Egeria distributed graph model – using reference copies
23
Database
Column
Glossary
Term
OMAG Server 1
OMAG Server 3
OMAG Server 2
Database
Column
Glossary
Term
Meaning
§ Alternatively, both entities could be replicated to a third server, as reference copies
§ The originals are still the authoritative instances
§ A relationship could be defined between the local reference copies
23. Egeria distributed graph model – using entity proxies
24
Database
Column
Glossary
Term
OMAG Server 1
OMAG Server 3
OMAG Server 2
Meaning
Database
Column
Glossary
Term
§ Instead of replication, the third server could relate the original entities using entity proxies
Entity Proxy
25. Egeria OMRS Repositories
26
Search
Open Metadata Access Services
Open Metadata Repository Services
• Egeria includes a choice of metadata repositories, which can be used as additional metadata stores that can plug
functional gaps between other tools and repositories and can provide local access
• One of the Egeria repositories is a graph repository, which lends itself to the types of queries we saw earlier
26. Egeria Open Metadata Repository Services (OMRS)
• The OMRS defines a protocol and a set of connectors
• The Enterprise Connector performs cohort-wide operations – this
includes issuing queries to the cohort and when metadata is replicated
from another server it can use the local connector and repository to
cache it for availability and performance
• The Local Connector performs local operations and provides a default
Event Mapper that enables events relating to local operations to be sent
to the cohort
• The Repository Connector interfaces to a specific repository – and
optionally, may be accompanied by a custom Event Mapper
• Egeria provides two built-in repositories and there are connectors to
other repositories
• The interface to a repository connector is the MetadataCollection API,
described on the next slide
OMRS Enterprise Connector
OMRS Local Connector
& Event Mapper
OMRS Repository Connector
Repository
Cohort
MetadataCollection
API
27. The OMRSMetadataCollection interface
• The interface to an Egeria repository is the OMRSMetadataCollection interface
• It includes groups of operations:
• Group 1: Identification of the metadata repository - metadataCollectionId
• Group 2: Type definitions (types, attributes) - add, find, get, remove, …
• Group 3: Find instances (entities, relationships) - get, find, graph-queries, …
• Group 4: Maintain instances (entities, relationships) - addEntity, deleteEntity, …
• Group 5: Change control information (entities, relationships) - reIdentify, reHome, …
• Group 6: Maintenance of reference (replica) copies – save, purge, refresh,…
28. Egeria Local Graph Repository
• The Egeria distribution includes a persistent repository and a non-persistent repository
• The persistent repository is a graph repository built on JanusGraph, an open-source graph database project, hosted by the
Linux Foundation
• http://janusgraph.org
• http://github.com/janusgraph/janusgraph
• The built-in graph repository provides an OMAG Server with a persistent metadata store and is built using Egeria’s ‘plugin’
pattern
• The graph repository can store instances of metadata owned by the local server
• It can also store reference copies of metadata instances replicated to the local server
• It also supports relationship instances that refer to entity proxy instances
• Other graph databases are available, and Egeria’s pluggable connector architecture enables the creation of repository
connectors for different databases.
• The Conformance Test Suite provides a set of automated tests that can be run against a repository to assess whether it
correctly implements the Egeria types and interfaces
29. Anatomy of the local graph repository
30
Graph Metadata Store
JanusGraph
persistence
search
OMAG Server
OMAS – access services
OMRS Enterprise Connector OMRS topics
in
out
Apache
Tinkerpop
OMRS Local Connector
& Event Mapper
OMRS Graph Connector
JanusGraph
Management
Cohort
30. Graph Repository configurations
• The first release of the Egeria Graph Repository used BerkeleyDB and Lucene as embedded persistence
and indexing backends. This provides a relatively simple quick-start configuration, especially good for
development and testing and sufficient for some production uses.
• In production it may be desirable (or essential) to use a different persistence backend (e.g. Cassandra) or
indexing backend (e.g. Elastic).
• ING Bank added to the configuration of the Graph Repository to enable the use of (remote) Cassandra and
Elastic services.
• Discussions have started about work to add a remote JanusGraph Server configuration in order to provide
an HA option.
31. Graph Repository components
• GraphOMRSRepositoryConnector - implements the open connector framework interface
• GraphOMRSRepositoryConnectorProvider – implements the mechanism for brokering a connector
• GraphOMRSMetadataCollection – top level interface supporting type and instance operations
• GraphOMRSMetadataStore – implements the MetadataCollection using a graph database
• GraphOMRSGraphFactory – creation, schema, indexing - encapsulates JanusGraph-specifics
• Mappers – convert between OMRS objects and graph vertices and edges
• GraphOMRSEntityMapper
• GraphOMRSRelationshipMapper
• GraphOMRSClassificationMapper
• Plus various utility classes – error codes, audit logging, constants and utility methods
https://github.com/odpi/egeria/
See open-metadata-implementation/adapters/open-connectors/repository-services-connectors/
open-metadata-collection-store-connectors/graph-repository-connector
32. To use the Egeria Graph Repository
• Configure the OMAG Server with repository-mode = ‘local-graph-repository’
• e.g. HTTP POST http://localhost:8080/open-metadata/admin-
services/users/{username}/servers/{servermame}/local-repository/mode/local-graph-repository
• Start the OMRS instance in the server
• e.g. HTTP POST http://localhost:8080/open-metadata/admin-
services/users/{username}/servers/{servername}/instance
• If using the embedded configuration of Berkeley DB for persistence and Lucene for indexing,
when OMRS starts, the graph repository auto-creates a JanusGraph database – including:
• Persistence backend
• Search backend
• Graph schema
• Search indexes
• If using alternative backends for persistence or indexing, ensure that they are correctly configured
and available before starting the OMAG Server.
33. Graph Schema
The MetadataCollection interface is the formal interface to an Egeria repository.
Whilst it is possible to look at the graph directly (e.g. using Gremlin console):
Please don’t rely on the schema – it is likely to evolve
Type data:
• The Graph Repository does not store type definitions
• It delegates all type operations to the Repository Content Manager
Instance data:
• The Egeria Graph Repository stores instance data, using a JanusGraph schema that has:
• vertices for entities and classifications
• edges for relationships and classifiers
37. Local instances, reference copies and proxies
38
• The graph contains one vertex per entity – whether the entity is local, a reference copy or a proxy
• If the entity has an associated classification, the classification is stored as a vertex, with an edge from the
entity vertex to the classification vertex
• The graph contains one edge per relationship – whether the relationship is local or a reference copy
• Reference Copies
• The metadataCollectionId core attribute is set to the ‘guid’ of the home repository
• Entity Proxy objects
• Each entity instance has a vertex property of type Boolean, to indicate whether the instance is a
proxy
38. Metadata Collection ‘graph-query’ methods
• There are 4 sub-graph query methods:
• getRelatedEntities() - optional
• Returns the entity and its immediate neighbors
• getEntityNeighborhood() - optional
• Returns the entity and its neighbors up to the depth specified by
the ‘level’ parameter
• getLinkingEntities() - optional
• Returns the relationships and intermediate entities that connect
the specified pair of entities
• getRelationshipsForEntity() - mandatory
• Returns relationships associated with entity, optionally filtered
by relationship type and status
level = 2
39. Graph Repository – supported functions
• The GraphRepository supports most of the OMRS MetadataCollection API, including:
• Save and purge of reference copies
• Use of entity proxies
• Delete and restore as well as purge – delete is a soft, restorable delete; purge is permanent
• Re-type of instances
• Re-identify of instances
• Re-home of instances
• The four ‘graph queries’ – described on the previous slide
• The ‘find’ methods – find..ByProperty, find..ByPropertyValue, findEntityByClassification
• The Graph Repository does not (yet) support:
• Historic queries – find methods that specify an asOfTime parameter
• Undo of previous instance updates