Introduction
In recent years, there has been a significant push towards decentralized data organizations where different domains (e.g. lines of business) are partially or fully responsible for exposing their own data for analytics. Perhaps the most popular example is the Data Mesh architectural paradigm for data management proposed by Zhamak Dehghani in 2019.
The main objective of these paradigms is to speed up the creation and availability of trusted, high-quality, compliant data to share across the enterprise by training business professionals to produce this data rather than always relying on centralized IT teams who may not be able to keep pace with demand.
These paradigms are founded in decentralization and distribution of data responsibility to people who are closest to the data. It arises from the insight that centralized, monolithic data architectures suffer from some inherent problems:
- A lack of business understanding in the data team. Centralized data teams need to deal with data they do not fully understand to solve business problems that they also do not completely understand. This forces continuous back-and-forth between the data team and the business groups, slowing down the process and affecting the quality of the final result.
- The lack of flexibility of centralized data platforms. Centralizing all data into a single platform may be problematic, because the needs of big organizations are too diverse to be addressed by a single platform: one size never fits all.
- Slow data provisioning and response to changes. Every new data requisition from a business unit always requires ingesting the data in the centralized system and performing changes in the pipelines at all stages of the platform. This makes the system rigid and brittle when changes happen.
A more detailed explanation from Zhamak Dehghani about these principles in the context of the Data Mesh can be found at the article Data Mesh Principles and Logical Architecture.
This document details design guidelines and best practices that can be used to implement many of these decentralization principles using Denodo. For a high level overview of how Data Virtualization can help to implement decentralized organizations it would be recommended to read Why Data Mesh Needs Data Virtualization from the Data Virtualization blog.
Data Mesh and Decentralization Principles
We will use the four main principles in the Data Mesh paradigm as a way to organize the rest of the document. We will describe each principle and explain guidelines to implement it in Denodo. Notice however that our discussion does not apply exclusively to the Data Mesh; it applies to any decentralized organization where data responsibilities are distributed among different teams.
The four main principles in the Data Mesh paradigm are:
- Domain oriented decentralized data ownership and architecture.
- Each Domain or Organizational Unit is in charge of managing and exposing their own “Data Products” to the organization.
- Removes dependency on fully centralized data infrastructures that often become bottlenecks.
- In doing so, it removes those bottlenecks and accelerates changes.
- Gives flexibility to domains to choose their own data integration approach.
- Data as a Product.
- Avoid data silos by publishing feasible, valuable and usable Data from a specific Domain.
- Data Products should be easily discoverable, understandable and accessible to the rest of the organization.
- Self-service data infrastructure as a platform.
- Avoid complexity and duplication of efforts across domains.
- Allows domains to build, deploy, publish and manage data products in a self-serve manner.
- Operated by a central team, but the central team does not develop the products.
- Federated computational data governance.
- Ensure interoperability between domains.
- Common semantics and conventions for shared entities.
- When needed, apply global security and governance policies which are common to all domains.
In the next sections, we will review each one of them in more detail.
Data Product
A Data Product is an autonomous, read-optimized, standardized data unit containing at least one Domain dataset created to satisfy user needs.
The Data Mesh paradigm details the characteristics that a Data Product should have, like discoverability, addressability, etc. In the following sections we will go into detail for each one of these characteristics, and will explain how they can be implemented using Denodo’s platform.
Discoverable |
The Data Product has to be easily discoverable and users should be able to explore the available Data Products. Information such as “source of origin, owners, runtime information such as timeliness, quality metrics, sample datasets, and most importantly information contributed by their consumers such as the top use cases and applications enabled by their data” should be available. |
This can be implemented in Denodo by adding the following metadata to the view(s) that compose the Data Product:
|
Addressable |
“The unique address must follow a global convention that helps users to programmatically and consistently access all Data Products. The data product must have an addressable aggregate root that serves as an entry to all information about a data product, including its documentation, service-level objectives, and the data it serves.” |
This can be implemented in Denodo by using the RESTful Web Service and the Data Catalog:
|
Understandable |
Each Data Product provides semantically coherent data with a specific meaning. Data samples, example code, feedback from users… |
This can be implemented in Denodo by using the Denodo Data Catalog and the Denodo Notebook:
|
Trustworthy and trustful |
The Data Product is truthful and it represents the fact of the business correctly. It has to guarantee and communicate Data Products Service Level Objectives (SLOs): interval of change, timeliness, completeness, statistical shape of data, lineage, operational qualities like performance and availability, precision and accuracy over time… |
This can be implemented in Denodo by using the Data Lineage and acceleration techniques:
|
Natively accessible |
“A Data Product needs to make it possible for various data users to access and read its data in their native mode of access”. For example, Data Analysts want to explore data in spreadsheets or query languages, data scientists want to use data frames, developers expect a real-time stream of events or pull-based APIs. |
This can be implemented in Denodo by using the different accessing interfaces provided or exporting the data:
|
Interoperable |
Correlate data across Domains and stitch them together by linking, combining and aggregating Data Products. In order to facilitate this, it will be necessary to standardize consistent field types, polysemic identifiers, data product global addresses, common metadata fields, schema linking (ability to reuse schemas from other products), data linking (ability to link to data in other products), schema stability (schema evolution preserving backwards compatibility). |
Combining Data Products and tracing their dependencies (lineage/change impact) is straightforward with Denodo:
|
Valuable on its own |
A Data Product should carry a dataset that is valuable and meaningful on its own without being joined and correlated with other data products. |
This is a conceptual characteristic with no relation to any particular implementation or technology. |
Independently developed, tested and deployed |
Data Products have independent lifecycles and they are built and deployed by independent Domain teams. |
This characteristic can be implemented by Denodo using the Solution Manager:
|
Secure |
“Data users access a data product securely and in a confidentiality-respecting manner”. Write security policies in a way that they can be versioned, automatically tested, deployed and observed, and computationally evaluated and enforced: access control, encryption, confidentiality levels, data retention and regulations and agreements. |
This can be implemented in Denodo through: There are more details about this in section “Federated Computational Governance”. |
Data Product Types
The Data Mesh paradigm distinguishes between three types of Domain Data which typically correspond to different Data Product types. The definitions are:
- Source-Domain Data Products: This is the native Data Product. It contains analytical data reflecting the business facts generated by the operational systems.
- Aggregate-Domain Data Products: These Data Products are derived from the Source-Domain Data Products. They represent analytical data that is an aggregate of multiple Domains and complex queries.
- Consumer-Aligned Data Products: Data Products that fit the needs of one or multiple specific use cases.
The next section explains how these different types of products can be reflected in Denodo.
Implementation Best Practices
We will define implementation best practices by following the Data Mesh principles we have described in the previous section.
- Data as a Product.
- How to implement a Data Product in Denodo.
- Domain oriented decentralized data ownership and Architecture.
- How to organize virtual models in layers.
- Roles involved.
- Development and Deployment process.
- Version Control System integration.
- Self-service data infrastructure as a platform.
- Data Catalog capabilities for self-service.
- Federated Computational Data Governance.
- Interoperability between Data Products.
- Common semantics.
- Global Security Policies and Governance.
Data as a Product
A Data Product is represented in Denodo as an Interface view implemented by an Integration View that can be a combination of data from disparate Data Sources.
Domain Data Product Owners, publish their own Domain Data Products to be accessed across the enterprise. We will define the characteristics of the Domain Data Product Owner Role in the next section.
Those Data Products could be used to define other Data Products.
Data Products will be published in the Domain Virtual Database. Platform Product Owners could provide Metadata privileges to allusers role over the Interface Views that represent a Data Product in order to make their metadata accessible.
Restrictions over published Data Products could be applied through Global Security Policies.
Domain oriented decentralized data ownership and architecture
Standardized Layers
Creating Data Products is a responsibility of each Domain or Organizational Unit so each Domain will develop their Data Products in one or more Virtual Databases (VDBs).
Our recommendation is to define standardized layers, but depending on your context, more Virtual Databases could be defined if necessary. Each layer will have one or more VDBs. For example, in the Integration layer of the following diagram we will find at least one Virtual DataBase for each Domain.
Connection Layer
It contains the Data Sources configuration. It could be managed by the Platform Product Owners, so they will create the Data Sources available for the different Domains. It could be also managed by Domain Data Product Owners, so they will configure the Data Sources they need for building their Data Products. This will allow Domains to maintain their independence, but they will need deeper technical knowledge.
If Domain Data Product Owners and Developers do not have enough technical knowledge, the connection to the data sources may be created by the Platform Product Owners upon request.
Another consideration would be if different Domains will need to access similar data sources, because having only one data source defined in the Connectivity Layer will allow the platform to delegate cross-domain queries and apply different optimizations. In this case, the Platform Product Owner or the Domain Data Product Owner should remove the CREATE DATASOURCE from the Domain Data Product Developers. Data Sources would be created and maintained by the Platform Product Owner in the Connectivity Layer, so the Domain Data Product Owners and the Domain Data Product Developers will only need to create and use the Base Views they need.
Consider using Interface Views which are implemented by the Base Views from the Connectivity Layer in order to protect Domain implementations from source volatility.
Integration Layer
Each Domain has its own VDB which contains the integration views created for building their own Data Products and the Data Products themself.
Integration Views are private to each Domain, so they shouldn’t be accessible by other Domains.
Data Products could be accessible by other Domains and Business Users. A Data Product should be an Interface View that is implemented by an integration view.
Data Products Layer
Another layer could be added in order to publish the Data Products. Each Domain will have their own Data Products VDB where they will put the Data Products they want to expose to the enterprise. This way, a Domain will add to this VDB the elements it wants to make accessible, but transformations will remain private in the Integration layer.
This approach will make it easier to manage the permissions granted to other Domains. E.g. provide METADATA over all elements in the Domain Data Products VDB to “allusers” in order to allow all users logged in Virtual DataPort to see the metadata of the Data Products.
Roles
For each Domain there should be, at least, the following roles:
Domain Data Product Owner |
Responsible for creating, serving and evangelizing Data Products. Depending on the number and complexity of the subdomains and the Data Products, there could be one or multiple Data Product Owners in the same Domain. Can add or edit the metadata associated with the Data Products. |
A Domain Data Product Owner should have the following characteristics in Denodo:
|
Domain Data Product Developer |
Responsible for building and maintaining the transformation logic that generates the Data Products. Data product developers “work closely with their collaborating application developers in defining the domain data semantic, mapping data from application context (data on the inside) to the data product context (data on the outside)”. |
A Domain Data Product Developer should have the following characteristics in Denodo:
|
Platform Product Team (not Domain specific) should have the following roles:
Platform Product Owner |
Responsible for facilitating the prioritization of the platform services to design and build the experience of the users of the platform. |
A Platform Product Owner should have the following characteristics in Denodo:
|
Technical Validator |
The target of this role will be to review the views created by the Domains from a technical perspective and, if necessary, apply optimizations or refactoring. |
A Technical Validator should have the following characteristics in Denodo:
|
Deployment of Data Products
Using the Solution Manager, each Domain can take care of deploying their own elements from the development environment to production. We recommend the following approach:
- Enabling option Authenticate with current user credentials for creating revisions, the revisions will be created using the credentials of the user that is logged in the Solution Manager, so users will only have access to their own elements from their own Domain.
- Grant QA environment specific DEPLOY privileges to Domain Data Product Developer roles. Users with the Deploy privilege over an environment can do the following:
- Access to basic information of a specific environment, its clusters and servers, in read-only mode.
- Create a revision to be deployed and validated in this environment.
- Edit or remove its own revisions.
- Grant QA environment specific DEPLOY ADMIN to Domain Data Product Owner role.
- Validate, create and deploy revisions.
- Versioning the deployed Data Products.
- Versioning Data Products using different version numbers in its name (e.g. <DATA_PRODUCT>_V1, <DATA_PRODUCT>_V2…) or maintain 2 different versions of each Data Product (e.g. <DATA_PRODUCT> and <DATA_PRODUCT>_NEW), so the new version will become the current version after a predefined time period.
- Automate the verification of the development standards through the Denodo Testing Tool or a similar tool.
- Additionally, this process can be automated and integrated with external lifecycle management tools, like Jenlkins. To enable that, the Solution Manager provides a REST API with broad functionality
Version Control System Integration
Each Domain can maintain its own repository in a Version Control System to manage and track development.
If there are a number of developers in a single Domain, it may be recommended to create a copy of the Domain Virtual Database for each developer. Domain Data Product Developers will work on their own Virtual Database for the project, which will be connected to the Version Control System with a personal user. Domain Data Product Owners of the Domain Virtual Database should keep the main VDB updated with the changes committed by the Domain Data Product Developers from their own copies of the Domain Virtual Database.
Self-service data infrastructure as a platform
The Denodo Data Catalog can be used as a Self-Service Data Platform by the Domain Users in order to search for Data Products from other Domains, browse their relationships and reduce duplication.
It also provides several strategies to help a Business user or a Domain user to discover valuable data resources by searching relevant terms in the catalog, using both the metadata of the Data Products and their actual data as well. It also helps to organize Data Products in categories and browse them, receive automatic recommendations of data assets according to the user interactions, inspect which Data Products are important for other users in your company or navigate through the relationships defined between Data Products.
Federated Computational Data Governance
Ensure interoperability and global policies
For the majority of use cases, in order to get value in the form of higher-order datasets, insights or machine intelligence there is a need for these independent Data Products to interoperate: to be able to correlate them, create unions, find intersections, or perform other graphs or set operations on them at scale.
Through Data Virtualization it is possible to combine data from multiple Data Products by applying operations such as joins, unions, intersections…
Common Semantics And conventions for shared entities
It is recommended to define a standard that all Domains follow when creating Data Products.
- VDP Naming Conventions.
- Standardized View definitions including names, descriptions, tags…
- Automated verification of development standards.
Global security and governance policies
Global Security Policies along with Tag Management allow each Domain Data Product Owner to be accountable for the security of their data product while making sure all Data Products are consistently and reliably secure.
- Tags are used to apply Global Security Policies to Data Products.
- Domain Data Product Owners can be allowed to set security and masking policies themselves.
- Platform Product Owners can still add their own security policies on top of the ones defined by the Data Product owners.
- Global Security Policies for reusable objects that apply to all Domains can be defined in lower level views.
- If row restrictions are applied in a federated environment it is recommended to follow the next best practices:
- If the organization needs a more centralized control of the Global Security Policies:
- Each Domain Data Product Owner can be responsible to apply tags in their own Data Products.
- Platform Product Owners can define the policies to be applied over defined tags.
Summary
As seen throughout this KB article, decentralized data organizations offer organizational benefits to the company, such as the independence of each Domain to create its own Data Products and share them within the company, speed up the creation of those Data Products and improve their data quality.
Data Virtualization can be used to implement the main decentralized principles by allowing the Domain independency over the same platform, simplifying the creation and interoperability of the Data Products through integration capabilities, providing the Domains with governance and security over their data and including a self-service platform through Data Catalog.
References
- Dehghani, Zhamak. Data Mesh. O'Reilly Media. Kindle Edition.
- The Value of Data Virtualisation in a Data Mesh - By Intelligent Business Strategies - January 2022
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.