Denodo Virtual DataPort allows configuring a Cache Engine to store local copies of the data retrieved from the data sources. This can be useful for several purposes, such as enhancing performance, protecting data sources from costly queries, and/or reusing complex data combinations and transformations.
Since Denodo version 8, Denodo also includes smart query acceleration techniques using a new type of view called Summary. Summaries allow to store common intermediate results that the query optimizer can then use as a starting point to accelerate analytical queries.
At the same time, the Denodo Platform supports user and role-based authentication and authorization mechanisms with both schema-wide permissions (e.g., to access Denodo databases and views) and data-specific permissions. For the particular case of views, Denodo allows defining fine-grained execution privileges to specify what particular rows and columns should be visible to a user/role when that view is executed as well as for INSERT/UPDATE/DELETE operations.
The purpose of this document is to provide best practices on how to design the caching and/or smart query acceleration strategies considering the security requirements on the views to make sure on the one hand, these requirements are preserved and, on the other hand, queries from restricted users can still benefit from these acceleration features.
This document is aimed at people who are already familiar with the caching system as well as with the Denodo security system and the different kinds of fine-grained view privileges available: column restrictions, row restrictions and custom policies.
This document is structured on the following sections:
- Modeling in Denodo following a layered architecture. A virtual model in Denodo is usually structured in different layers that can involve different development teams with different people in charge of assigning the privileges of the views built by each team. This section describes the most common layers to understand the need of defining fine-grained privileges on certain layers.
- Fine-grained privilege limitations using cache. In the multi-layered scenario described in the previous section, the stewards of the lower layers may want to limit the indirect access other roles can have from higher layers, for example using fine-grained privileges. Caching a view from a higher layer can be incompatible with preserving the fine-grained privileges defined underneath. This section describe several strategies to deal with this situation:
- Create summaries using different versions of the dataset
- Define the restrictions on the cached view
- Create different views aimed at different roles and cache each one with the data for each role
- Modeling summaries considering fine-grained privileges. This section analyzes in more detail how to model a summary that both preserves restrictions and accelerates performance.
- Final conclusions. This section summarizes the main ideas and best practices exposed in the document.
1 Modeling in Denodo following a layered architecture
Virtual models in Denodo are usually designed following a layered architecture in order to separate the different responsibilities, increase reusability and help the security management and maintenance. The chosen layers may vary but in order to illustrate the best practices we will use the following layered structure as it is commonly used:
- A base layer or physical layer, containing the data sources’ details and base views
- An integration layer, containing the intermediate transformations and combinations needed to create the views exposed to consumers.
- A semantic layer containing the canonical entities that act as data assets for different purposes and can be transversal to different areas.
- A modeling layer containing views that different business units may need to create for specific purposes.
- An applications and reporting layer containing the final views which the different business units want to publish for specific applications, and optional pre-canned reports with calculated metrics. May also contain final views published to certain clients as an API.
In general, data consumers can access subsets of the views in the applications and modeling layer (or even in the semantic layer) depending on their privileges.
Data stewards must decide which views will be available to what data consumers and what restrictions should apply to each one: different data services, business power users, data scientists, etc. For example, let’s say a company has the following two virtual databases in Denodo:
- Core_db: This database contains the views from the semantic layer. Among others it contains the views EMPLOYEE and DEPARTMENT. The data steward for this database has assigned the following privileges in the production environment:
- Role hr_manager has EXECUTE privilege on EMPLOYEE.
- Role sales_manager has a masking restriction on the salary column so that role can only see the salary for employees in the sales department.
- HR_db: This database contains the views of the Human Resources development team. The team has built a view SALARY_DETAILS combining DEPARTMENT and EMPLOYEE. The data steward for this database has assigned EXECUTE privilege to the role sales_manager on SALARY_DETAILS.
If you specify a fine-grained restriction on a view for a specific role, like masking the salary column on EMPLOYEE, that restriction is always applied on that view even if the user with that role executes a view built on top of it (unless the user has another role with more privileges). Therefore, if a user with role sales_manager executes the view SALARY_DETAILS, Denodo will still mask the salary information coming from EMPLOYEE:
See the document “Best practices in designing fine-grained privileges in multi-layered virtual models” of the Knowledge Base for more details on how to design the fine-grained privileges on a multi-layered architecture like this one.
2 Fine-grained privilege limitations using cache
If a view is configured to use cache, the restriction enforcement may not be possible. This is because when a user loads the cache, it’s populated with the version of the data that particular user can see, and once that data is in the cache, Denodo will access the cached data regardless of the user. In the example, if a user without any fine-grained restrictions loads the cache for the view SALARY_DETAILS, a user with role sales_manager executing that view will see all data as Denodo will access the cache directly:
Best practice: When using cache in view hierarchies with fine-grained privileges, define the restrictions on those views with cache configured or on derived views on top of them, but avoid defining them on views under a cached view.
It’s important to note that the best practices in designing fine-grained privileges establish that, in general, they should be defined on the views exposed to other users. Therefore, the best practice above will be already verified organically in most cases.
There are situations however, where following this best practice might not be possible. In the previous example, views EMPLOYEE and SALARY_DETAILS belong to different development teams with different data stewards. In addition, applying caching on EMPLOYEE would not be an option in this case as it would not obtain the same performance benefit as applying it on the aggregated view on top.
For those specific cases, there are different ways to design your caching strategy preserving the user restrictions but we will focus on 3 of them: 1) using summaries for accelerating the execution without breaking the security rules set in the view’s underneath, 2) Defining the restrictions on the cached view; and 3) creating views for each role, and caching them.
2.1 Create summaries using different versions of the dataset
Instead of using traditional cache, an alternative approach available since Denodo 8.0 is to use summary views to accelerate queries. Differences between caching and summaries make them more suitable for some of the modeling layers and not others (See section ‘Summaries vs Cache’ in the document “Best Practices to Maximize Performance III: Caching”).
In contrast with the cache, Denodo will only use the summary if it can preserve the fine-grained privileges defined for the user executing a query. In section 3 we describe in detail how summaries can be a better strategy than caching when the security requirements include fine-grained privileges.
2.2 Define the restrictions on the cached view
If the cached view projects the necessary fields to define the restriction, it would be a good idea to define the restriction on this view, either duplicating the one from the sub-view or adapting the restriction considering the specific audience of this particular derived view.
In the example, we could modify the permissions of sales_manager on SALARY_DETAILS to include the same masking policy. A better approach in terms of maintenance would be to use tag-based security policies as it would avoid duplicating the rule as long as the sensible fields have the right tags in place.
2.3 Create different views aimed at different roles and cache each one with the data for each role
If none of the previous strategies fit in your scenario, you can take another approach: In order to provide different versions of the cached data to different users you can create different views in Virtual DataPort so each role has access to each copy of the view.
For example, consider in our example the company has two regional divisions: EMEA (Europe, the Middle East and Africa) and APAC (Asia-Pacific), and we have two different roles in HR for each region: hr_emea and hr_apac. The data steward for the semantic layer has assigned a row restriction to hr_emea and hr_apac on the view EMPLOYEE so these roles can only see the information from EMEA employees or APAC employees respectively.
Let’s say it is not possible to define the same privilege at the top view level because the region column is not available. A different approach would be to create two different identical views either by creating new views SALARY_DETAILS_APAC and SALARY_DETAILS_EMEA in the same virtual database or by creating different virtual databases hr_emea and hr_apac and create the same view SALARY_DETAILS on each one.
Then you can load the cache for each one using the right user for each load:
When hr_emea stores the data in the cache for SALARY_DETAILS_EMEA, as that role can only see employees from the EMEA region, the data stored in cache will contain the information from that region only. The same happens when the role hr_apac loads the cache for the view SALARY_DETAILS_APAC.
In the specific case where different roles have access to different virtual databases you just need to create the same view on each database and load the cache following the same procedure:
3 Modeling summaries considering fine-grained privileges
Denodo 8.0 includes a feature called Smart Query Acceleration, which dynamically selects pre-stored data to avoid performing some of the same data combinations. It consists of: 1) pre-computing interesting data aggregates called “summaries,” so then 2) the query optimizer can automatically decide to use these summaries in future queries.
For example, let’s say a user sends a query to obtain the average salary of the company employees. If there is a summary already created containing the average salary by department, the query optimizer will detect that it can use the data in that summary to get the final result by applying a final aggregation.
As summaries are transparent to the user and more flexible than traditional caching, they allow to keep the original view for all roles and use one summary or another depending on the user executing the query.
Imagine we create a new summary sum_max_salary as:
SELECT deptno, max(salary) FROM SALARY_DETAILS GROUP BY 1
In this case, if an admin user executes SELECT max(salary) as max_salary FROM SALARY_DETAILS, the query optimizer will detect it can use the data stored in the summary to answer that query.
On the other hand, if a user with role hr_emea executes the same queries, the query optimizer will detect that it cannot use the summary as the query for that user requires an extra condition that is not included in the pre-stored data (see image below).
This means in this case the role hr_emea will not take advantage of the summary sum_max_salary.
In order to allow role hr_emea to take advantage of the summary acceleration, recreate the summary including the necessary fields to evaluate the restrictions.
For example, we could create a new summary sum_max_salary_region including the maximum salary by department and region:
SELECT region, deptno, max(salary) FROM SALARY_DETAILS GROUP BY 1,2
Once the summary is created and loaded, if a user with role hr_emea executes the previous query, the query optimizer will detect there is a summary available with a partial result that can use and apply the restriction and the final aggregation on top:
Notice role hr_apac would also benefit from the same summary.
This strategy has the advantage that it is completely transparent to the user as the view on your model is the same one in all cases. Denodo will take into account if the query used to load the summary includes the required user restrictions and therefore the user will never access data that is not supposed to see. In case there is no summary that matches the required restrictions it will access the original data instead of using the pre-stored one.
Using summaries has the advantage that the optimizer will not select a summary that does not fulfill the view’s restrictions. At the same time, it is important to make the right design decisions to make sure the summaries you select are actually used for the restricted users.
In this regard, it is a good idea to define summaries using the same views containing the fine-grained privileges, which will also allow to include in the summary those columns that can be necessary to apply a restriction over the summary like in the previous example.
3.1 Best practices designing summaries on a Logical Data Warehouse
In analytical scenarios, it is common in Denodo to model a Logical Data Warehouse architecture. This offers data consumers the appearance of a traditional data warehouse, but where the data can be distributed among different systems. This is an important use case where summaries fit particularly well and it deserves some extra consideration.
In a Logical Data Warehouse architecture, there is usually a facts table and several dimensions following a star-schema configuration. In a traditional star schema scenario, in most cases the fine-grained privileges will be placed over the dimensions. This means that even though there can exist several roles with different fine-grained privileges, that does not imply you need to create a different summary for each one. Instead, we could create a summary over the facts table and that summary could be used for multiple queries with different restriction needs.
For example, let’s say we are a retailer company and we create a view CURRENT_STORE_SALES combining views SALES, ITEM and EMPLOYEE obtaining some sales information for the sales of the current year:
We want to cache that derived view and there exists several roles with different restrictions over columns on ITEM and EMPLOYEE. We could create the following summary:
If now a user executes a query over CURRENT_STORE_SALES the optimizer will rewrite the query as follows:
It will use the data cached in the summary and then complete the columns from the dimensions tables applying the necessary restrictions on each one.
4 Final conclusions
When your model requires fine-grained privileges in combination with either caching or smart query acceleration using summaries you need to take into account:
- Summaries offer more advantages in general, as more queries will potentially benefit from them. In addition, even if they are not configured taking the privileges into account the worst-case scenario will always preserve the security restrictions. In order to use summaries combined with fine-grained privileges there are two main best practices to follow:
- Define the restrictions and the summaries on the same views, those top views that are accessible to other users.
- Include in the summary definition those columns that are necessary to evaluate a restriction, or create different summaries aimed at different roles, including the specific restrictions on each one.
- Caching has additional configuration capabilities and it does not depend on the optimizer decisions, which means it will always access the cached data without any overhead from the optimization process. This makes caching especially attractive for final reports or base views requiring incremental mode. In order to use caching combined with fine-grained privileges:
- Define the restrictions on those views with cache configured or on derived views on top of them, but avoid defining them on views under a cached view.