This document provides an agenda and overview for a lunch and learn session on how data virtualization can enable a data mesh architecture. The session will discuss what a data mesh is, how it addresses challenges with centralized data management, and how data virtualization tools allow domains to create and manage their own data products while maintaining governance. It highlights how data virtualization maintains domain autonomy, provides self-serve capabilities, and enables federated computational governance in a data mesh. The presentation will demonstrate Denodo's data virtualization platform and discuss why a data lake alone may not be sufficient for a data mesh, as data virtualization offers more flexibility and reuse.
1. DENODO LUNCH & LEARN
26 OCTOBER
WHY DATA MESH NEEDS
DATA VIRTUALIZATION
2. Presenters for this Session
Chris Day
Director, APAC Sales Engineering, Denodo
Regional Vice President, Sales, ASEAN & Korea, Denodo
Elaine Chan
3. Agenda
1. What is a Data Mesh
2. What is Data Virtualization (DV)
3. How can DV Enable a Data Mesh
4. Implementation Strategies
5. Why a Data Lake alone is not Enough
6. Q&A
7. Next Steps
5. 5
What is a Data Mesh
§ The Data Mesh is a new architectural paradigm for data
management.
§ Proposed by the consultant Zhamak Dehghani in 2019
§ It moves from a centralized data infrastructure managed by a
single team to a distributed organization .
§ Several autonomous units (domains) are in charge of
managing and exposing their own “Data Products” to the rest
of the organization.
§ Data Products should be easily discoverable, understandable
and accessible to the rest of the organization.
6. 6
What Challenges is a Data Mesh Trying to Address?
1. Lack of domain expertise in centralized data teams
§ Centralized data teams are disconnected from the business
§ They need to deal with data and business needs they do not always understand
2. Lack of flexibility of centralized data repositories
§ Data infrastructure of big organizations is very diverse and changes frequently
§ Modern analytics needs may be too diverse to be addressed by a single platform:
one size never fits all.
3. Slow data provisioning and response to changes
§ Requires extracting, ingesting and synchronizing data in the centralized platform
§ Centralized IT becomes a bottleneck
7. 7
How?
§ Organizational units (domains) are responsible for managing and
exposing their own data
§ Domains understand better how the data they own should be processed and
used
§ Gives them autonomy to use the best tools to deal with their data, and to
evolve them when needed
§ Results in shorter and fewer iterations until business needs are met
§ Removes dependency on fully centralized data infrastructures
§ Removes bottlenecks and accelerates changes
§ Introduces new concepts to address risks like creating data silos,
duplicated effort and lack of unified governance
§ Will be explored in the following slides
8. 8
Data as a Product
§ To ensure that domains do not become isolated data silos,
the data exposed by the different domains must be:
§ Easily discoverable
§ Understandable
§ Secured
§ Usable by other domains
§ The level of trust and quality of each dataset needs to be
clear.
§ The processes and pipelines to generate the product (e.g.
cleansing and deduplication) are internal implementation
details and hidden to consumers.
9. 9
Self-serve Data Platform
§ Building, securing, deploying, monitoring and managing data
products can be complex
§ Not all domains will have resources to build this infrastructure
§ Possible duplication of effort across domains
§ Self-Serve: While operated by a global data infrastructure team, it
allows the domains to create and manage the data products
themselves.
§ The platform should be able to automate or simplify tasks such as:
§ Data integration and transformation
§ Security policies and identity management
§ Exposure of data APIs
§ Publish and document in a global catalog
10. 10
Federated Computational Governance
§ Data products created by the different domains need to
interoperate with each other and be combined to solve new needs.
§ e.g. to be joined, aggregated, correlated, etc.
§ This requires agreement about the semantics of common entities
(e.g. customer, product), about the formats of field types (e.g.
SSNs, entity identifiers,...), about addressability of data APIs, etc.
§ Managed globally and, when possible, automatically enforced
§ This is why the word ‘computational’ is used in naming this concept
§ Security must be enforced globally according to the applicable
regulations and policies.
12. 12
Data Virtualization – A Data Fabric Layer
“Data Virtualization
creates a data
abstraction layer by
connecting,
gathering, and
transforming data
silos to support
real-time and near-
real time insights”
– Forrester Research, Inc.,
“The Forrester Wave:
Enterprise Data Fabric, Q2
2020,”
Consume
in business applications
Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users
DATA CONSUMERS
Analytical Operational
Multiple Protocols,
Formats
Query, Search,
Browse
Request/Reply,
Event Driven
Secure
Delivery
Combine
related data into views
CONSUME
Share, Deliver,
Publish, Govern,
Collaborate
COMBINE
Discover, Transform, Prepare,
Improve Quality, Integrate
CONNECT
Normalized views
of disparate data
SQL,
MDX
Web
Services
Big Data
APIs
Web Automation
and Indexing
Connect
to disparate data sources
Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word...
DISPARATE DATA SOURCES
More Structured Less Structured
3
2
1
13. 13
Data Virtualization: Essential Capabilities
Consistent, Flexible view of information across any consuming application
Data Abstraction:
Decoupling applications and data usage
from data sources and infrastructure
Zero Replication, Zero Relocation
Physical data remains where
they are
Real Time Information
Most reporting and analytical
tools can easily connect for real time
data
Self Service Data Marketplace
A Dynamic Data Catalogue for self-service
data discovery and data services available
in the virtualization layer
Centralized Metadata, Security &
Governance:
Manage access across all data assets in the
Virtualization layer for enterprise data
security and supports dynamic data
anonymization
Location-agnostic Architecture
For hybrid and multi-cloud
acceleration
15. 15
Easy Creation of Data Products
§ A modern DV tool like Denodo allows for access to any
underlying data system and provides advanced data
modeling capabilities.
§ This allows domains to quickly create data products from any
data source or combining multiple data sources and exposing
them in business-friendly form.
§ No coding is required to define and evolve data products.
§ Iterating through multiple versions of the Data Products
is also much faster thanks to reduced data replication.
§ Data products are automatically accessible via multiple
technologies
§ SQL, REST, OData, GraphQL and MDX.
16. 16
Maintains the Autonomy of Domains
§ Domains are not conditioned by centralized, company-wide data sources (data
lake, data warehouse). Instead, they are allowed to leverage their own data
sources.
§ E.g. Domain-specific SaaS applications or data marts
§ They can also leverage centralized stores when they are the best option:
§ E.g. Use centralized data lake for ML use cases
§ The domains can also autonomously decide to evolve their data infrastructure to
suit their specific needs.
§ E.g. Migrate some function to a SaaS application
17. 17
Provides Self-serve Capabilities
Discoverability and Documentation
§ Includes a Data Catalog which allows business users and other data consumers to quickly discover,
understand and get access to the data products.
§ Automatically generates documentation for the Data products using standard formats such as Open API
§ Includes data lineage and change impact analysis functionalities for all data products
Performance and Flexibility
§ Includes caching and query acceleration capabilities OOB, so even data sources not optimized for
analytics can be used to create data products.
Provisioning
§ Automatic autoscaling using cloud/container technologies. This means that, when needed, the
infrastructure supporting certain data products can be scaled up/down while still sharing common
metadata across domains.
18. 18
Enables Federated Computational Governance
§ The semantic layers built in the virtual layer can enforce standardized data models to represent the
federated entities which need to be consistent across domains (e.g. customer, products).
§ Can import models from modeling tools to define a contract that the developer of the data product must
comply with
§ Automatically enforces unified security policies, including data masking/redaction.
§ E.g. automatically mask SSN with *** except last 4 digits, in all data products except for users in the HR role
§ Data products can also be easily combined and can be used as a basis to create new data products.
§ The layered structure of virtual models allows creating components which can be reused by multiple domains
to create their data products.
§ For instance, there may be virtual views for generic information about company locations, products,...
§ Having a unified data delivery layer also makes it easier to automatically check and enforce other
policies such as naming conventions or API security standards.
20. 20
A Data Mesh in a Virtualization Cluster
SQL
Operational EDW
Data Lakes Files
SaaS APIs
REST GraphQL OData
Event
Product
Customer Location Employee
1. Each domain is given a
separate virtual schema.
A common domain may be
useful to centralized data
products common across
domains
2. Domains connect
their data sources
3. Metadata is mapped
to relational views.
No data is replicated
4. Domains can model
their Data Products.
Products can be used to
define other products
5. For execution, Products
can be served directly from
their sources, or replicated
to a central location, like a
lake
7. Products can be access via
SQL or exposed as an API.
No coding is required
Common Domain Event Management Human Resources
6. A central team can
set guidelines and
governance to ensure
interoperability
8. Infrastructure can
easily scale out in a
cluster
23. 23
A Data Lake Based Data Mesh
§ Data Lake vendors claim that you can build a Data Mesh using the
infrastructure of a Data Lake / Lakehouse.
§ This approach tries to introduce self-service capabilities in this
infrastructure for domains to create their own data products based
on data in the lake.
§ Domains may also have independent clusters/buckets for their
products.
24. 24
Challenges of That Approach
§ Many domains have specialized analytic systems they would like to use.
§ e.g. domain-specific data marts
§ The data lake may not be the right engine for every workload in every domain.
§ Domains are forced to ingest their data in the lake and go through all the process of creating and
managing the required ingestion pipelines, ELT transformations, etc. using the data lake
technology.
§ Data needs to be synchronized, pipelines operated, etc.
§ This can be a slow process and, in addition, it forces domains to introduce in the team staff with
those complex and scarce skills.
§ If the domains are not able to acquire those skills, then they need to rely on the centralized team and we are
back to square one
25. 25
How Does DV Improves That?
§ With DV, domains have the flexibility to reuse their own domain-specific data sources and
infrastructure.
§ The flexibility to use domain specific infrastructure has several advantages:
1. It allows domains to reuse and adapt the work they have already done to present data in formats
close to the actual business needs. This will typically be much faster
2. The domain probably has the required skills for this infrastructure
3. Domains can choose best-of-breed data sources which are especially suited for their data and
processes
§ Some domains can still choose to go through the data lake process for their products, but it does
not force all domains to do it for all their products.
§ The virtual layer offers built-in ways to ingest data into the lake and keep it in synch
§ In-lake or off-lake is a choice, not an imposition
26. 26
Additional Benefits of a DV Approach
1. Reusability: DV platforms include strong capabilities to create and manage rich, layered semantic layers
which foster reuse and expose data to each type of consumer in the form most suitable for them
2. Polyglot consumption: DV allows data consumers to access data using any technology, not only SQL. For
instance, self-describing REST, GraphQL and OData APIs can be created with a single click.
Multidimensional access based on MDX is also possible
3. Top-down modelling: you can create ‘interface data views’ which set ‘schema contracts’ which
developers of data products need to comply with.
§ This helps to implement the concept of federated computational governance.
4. Data marketplace: Ready-to-use data catalog which can act as a data marketplace for the data products
created by the different domains
5. Broad access: Even in companies that have built a company-wide, centralized data lake, there is typically
a lot of domain-specific data that is not in the lake. DV allows incorporating all that company-global data
in the data products
28. 28
Key Takeaways
1. Data Mesh is a new paradigm for data management and analytics
§ It shifts responsibilities towards domains and their data products
§ Trying to reduce bottlenecks, improve speed, and guarantee quality
2. Data lakes alone fail to provide all the pieces required for this shift
3. Data Virtualization tools like Denodo offer a solid foundation to implement this
new paradigm.
§ Easy learning curve so that domains can use it
§ Can leverage domain infrastructure or direct them towards a centralize repository
§ Simple yet advanced graphical modeling tools to define new products
§ Full governance and security controls
32. Building a Logical Data Fabric using
Data Virtualization
Chris Day
Director, APAC Sales Engineering, Denodo
Regional Vice President, Sales, ASEAN & Korea, Denodo
Elaine Chan
REGISTER NOW
denodo.link/DLL2110
ASEAN Virtual Lunch & Learn | 23-Nov | 1pm SGT