Data Mesh is the decentralized architecture where your units of architecture is a domain driven data set that is treated as a product owned by domains or teams that most intimately know that data either creating it or they are consuming it and re-sharing it and allocated specific roles that have the accountability and the responsibility to provide that data as a product abstracting away complexity into infrastructure layer a self-serve infrastructure layer so that create these products more much more easily.
2. e-Commerce: Platform as a service
Producers Consumers
Providers
Owner
PLATFORM
Creators of the Product
offerings
Buyers of the Product
offerings
Creators of Interfaces
for the platform
Creators of the Product
offerings
4. Data Landscape
• Operational data sits in databases behind business capabilities served with microservices, has a
transactional nature, keeps the current state and serves the needs of the applications running the
business.
• Analytical data is a temporal and aggregated view of the facts of the business over time, often
modeled to provide retrospective or future-perspective insights; it trains the ML models or feeds the
analytical reports.
9. Data Mesh: Architecture Principles
Domain-oriented
decentralized data
ownership and architecture
Data as a product
Federated computational
Governance
Self-serve data
infrastructure as a platform
10. Data Mesh Addressing Dimensions
Data
Mesh
Changes in the data landscape
Proliferation of sources of data
Diversity of data use cases and users
Speed of response to change
11. Data Mesh: Product Owner
• Delivering data as a product
• Objective measures
• data quality
• decreased lead time of data consumption
• data user satisfaction
• closest to the data are best equipped to manage it capably
12. Data Product: Attributes
• Discoverable. Easy to find in natural language.
• Addressable. Easy to access (once found), assuming the end user has permissions. If they don’t have
permissions, it’s vital they have a means to request access, or work with someone granted access.
• Trustworthy and truthful. Signals around the quality and integrity of the data are essential if people
are to understand and trust it. Data provenance and lineage, for example, clarify an asset’s origin and
past usages, important details for a newcomer to understand and trust that asset. Data observability —
comprising identifying, troubleshooting, and resolving data issues — can be achieved through quality
testing built by teams within each domain.
13. Data Product: Attributes
• Self-describing. The data must be easily understood and consumed — e.g., through data schemas,
wiki-like articles, and other crowdsourced feedback, like deprecations or warnings.
• Interoperable and governed by global standards. With different teams responsible for data,
governance will be federated (more on this later). But everyone must still abide by a global set of rules
that reflect current regulatory laws that respect geography.
• Secure and governed by a global access control. Users must be able to access data securely — e.g.,
through RBAC policy definition.
15. Self-Serve Data Infrastructure as a Platform: Persona Benefits
• For producers: Producers need a place to manage their data products (store, create, curate, destroy,
etc.) and make those products accessible to consumers.
• For consumers: Consumers need a place to find data products, within a UI that guides how to use
these products compliantly and successfully.
16. Technology planes of a self-service data mesh
• Plane 1: Data Infrastructure Plane. Addresses networking, storage, access control. Examples include
public cloud vendors like AWS, Azure, and GCP.
• Plane 2: Data Product Developer Experience Plane. This plane uses “declarative interfaces to manage
the lifecycle of a data product” to help developers, for example, build, deploy, and monitor data
products. This is relevant to many development environments, depending on the underlying
repository, e.g., SQL for cloud data warehouses.
• Plane 3: Mesh Supervision Plane. This is a consumer-facing place to discover & explore data products,
curate data, manage security policies, etc. While some may call it a data marketplace, others see the
data catalog as the mesh supervision plane. Simply put, this plane addresses the consumer needs
discussed above: discoverability, trustworthiness, etc. And this is where the data catalog plays a role.
17. Data Domains
• Domain oriented data decomposition and ownership
• Source oriented domain data
• systems of reality
• truths of their business domain
• raw data at the point of creation
• Consumer oriented and shared domain data
• Distributed pipelines as domain internal implementation
• Service Level Objectives for the quality of the data it provides: timeliness, error rates, etc
18. Data Mesh Implementation
• As such, a data mesh implementation “requires a governance model that embraces decentralization
and domain self-sovereignty, interoperability through global standardization, a dynamic topology, and,
most importantly, automated execution of decisions by the platform.” In this way, a conflict arises:
which rules are universal, and which are centralized? Which practices are universal, and which must be
tailored by domain?
19. Paradigm Shift : A New Language
Pre data mesh governance aspect Data mesh governance aspect
Centralized team Federated team
Responsible for data quality Responsible for defining how to model what constitutes quality
Responsible for data security
Responsible for defining aspects of data security i.e. data sensitivity
levels for the platform to build in and monitor automatically
Responsible for complying with regulation
Responsible for defining the regulation requirements for the
platform to build in and monitor automatically
Centralized custodianship of data Federated custodianship of data by domains
Responsible for global canonical data modeling
Responsible for modeling polysemes - data elements that cross the
boundaries of multiple domains
Team is independent from domains Team is made of domains representatives
Aiming for a well defined static structure of data
Aiming for enabling effective mesh operation embracing a
continuously changing and a dynamic topology of the mesh
Centralized technology used by monolithic lake/warehouse Self-serve platform technologies used by each domain
Measure success based on number or volume of governed data
(tables)
Measure success based on the network effect - the connections
representing the consumption of data on the mesh
Manual process with human intervention Automated processes implemented by the platform
Prevent error Detect error and recover through platform’s automated processing
20. Principles underpinning Data mesh
Domain-oriented decentralized data ownership
and architecture
So that the ecosystem creating and consuming data can
scale out as the number of sources of data, number of use
cases, and diversity of access models to the data increases;
simply increase the autonomous nodes on the mesh.
Data as a product
So that data users can easily discover, understand and
securely use high quality data with a delightful experience;
data that is distributed across many domains.
Self-serve data infrastructure as a platform
So that the domain teams can create and consume data
products autonomously using the platform abstractions,
hiding the complexity of building, executing and
maintaining secure and interoperable data products.
Federated computational governance
So that data users can get value from aggregation and
correlation of independent data products - the mesh is
behaving as an ecosystem following global interoperability
standards; standards that are baked computationally into
the platform.
21. Paradigm Shift : A New Language
• Serving over Ingesting
• Discovering and using over Extracting and loading
• Publishing events as streams over flowing data around via centralized pipelines
• Ecosystem of data products over centralized data platform
22. Data Mesh: Architecture Principles
• Domain-oriented decentralized data ownership and architecture
• Data as a product
• Federated computational governance
• Self-serve data infrastructure as a platform
Producer Proliferation
Customer Proliferation
Scale Out
Producer Proliferation
Customer Proliferation
Scale Out
Data Landscape
flowing data from operational data plane to the analytical plane, and back to the operational plane.
The first generation: proprietary enterprise data warehouse and business intelligence platforms; solutions with large price tags that have left companies with equally large amounts of technical debt; Technical debt in thousands of unmaintable ETL jobs, tables and reports that only a small group of specialized people understand, resulting in an under-realized positive impact on the business.
The second generation: big data ecosystem with a data lake as a silver bullet; complex big data ecosystem and long running batch jobs operated by a central team of hyper-specialized data engineers have created data lake monsters that at best has enabled pockets of R&D analytics; over promised and under realized.
The third and current generation data platforms are more or less similar to the previous generation, with a modern twist towards (a) streaming for real-time data availability with architectures such as Kappa, (b) unifying the batch and stream processing for data transformation with frameworks such as Apache Beam, as well as (c) fully embracing cloud based managed services for storage, data pipeline execution engines and machine learning platforms. It is evident that the third generation data platform is addressing some of the gaps of the previous generations such as real-time data analytics, as well as reducing the cost of managing big data infrastructure. However it suffers from many of the underlying characteristics that led to the failures of the previous generations.
https://www.alation.com/blog/data-mesh-vs-data-fabric/
A Data Swamp, in contrast, has little organization or no system. Data Swamps have no curation, including little to no active management throughout the data life cycle and little to no contextual metadata and Data Governance. Data Swamps have the problem of being of little use or unusable and frustrating.
Data Mesh: A culture
Data mesh inverts this model with domain-driven design and product thinking. Responsibilities are distributed to the people who are closest to the data. These product owners are responsible for delivering data as a product and, as such, they are accountable for objective measures, including “data quality, decreased lead time of data consumption, and general data user satisfaction…”
A data catalog is essential for “Data as a product” capabilities.
Previous data architectures failed to address scale in other dimensions: changes in the data landscape, proliferation of sources of data, diversity of data use cases and users, and speed of response to change. Data mesh addresses these dimensions, founded in four principles: domain-oriented decentralized data ownership and architecture, data as a product, self-serve data infrastructure as a platform, and federated computational governance. Each principle drives a new logical view of the technical architecture and organizational structure.
By “responsibility,” we mean the manipulation (creation and transformation), maintenance, and distribution of the data to the consumers who need it within the organization. This stands in contrast to the de facto models of data ownership (lakes and warehouses), in which the people responsible for the data infrastructure are also responsible for serving the data.
Data mesh supporters argue that this centralized model is no longer tenable in the expanding data universe of the enterprise. As data landscapes grow more wild, vast, and complex, centralized data ownership has become unwieldy and impossible to scale.
FAIR emphasizes that data must be Findable, Accessible, Interoperable, and Reusable to benefit humans and machines alike.
https://www.nature.com/articles/sdata201618
Code: it includes (a) code for data pipelines responsible for consuming, transforming and serving upstream data - data received from domain’s operational system or an upstream data product; (b) code for APIs that provide access to data, semantic and syntax schema, observability metrics and other metadata; (c) code for enforcing traits such as access control policies, compliance, provenance, etc.
Data and Metadata: well that’s what we are all here for, the underlying analytical and historical data in a polyglot form. Depending on the nature of the domain data and its consumption models, data can be served as events, batch files, relational tables, graphs, etc., while maintaining the same semantic. For data to be usable there is an associated set of metadata including data computational documentation, semantic and syntax declaration, quality metrics, etc; metadata that is intrinsic to the data e.g. its semantic definition, and metadeta that communicates the traits used by computational governance to implement the expected behavior e.g. access control policies.
Infrastructure: The infrastructure component enables building, deploying and running the data product's code, as well as storage and access to big data and metadata.
Lawrence Peter "Yogi" Berra was an American professional baseball catcher who later took on the roles of manager and coach.