You can translate the document:

Enrichment with AI

Data Enrichment

Data enrichment is a fundamental step in data projects using AI that significantly impacts the quality and relevance of the agent's responses. RAG systems operate by retrieving relevant external data before generating a response, and the effectiveness of this process is directly tied to the quality of the information it can access. Sometimes the LLMs will be able to come up with the right query but, often enough, having worked in pre-classifying and pre-enriching data will boost the chances to get the right insights. By enriching data, you go beyond simple keyword based searches and provide the AI with additional context and meaning, which is crucial for answering complex queries.

For instance, consider an AI agent trying to answer why customer satisfaction is declining. Without enrichment, the agent might only see raw sales figures and customer names. With enrichment, unstructured text from customer reviews in an online shop is analyzed by AI functions to tag it with structured metadata, such as sentiment scores (“negative”, “neutral”, “positive” in a new customer_sentiment enriched field, for instance) or key tags representing sentiment themes (for example, "slow delivery" or "poor product quality"). This enriched data provides a more comprehensive picture, enabling an application to give a more insightful and accurate response. The AI can then retrieve this enriched information to generate a grounded, accurate response, like "Customer satisfaction is declining due to negative sentiment in online reviews, particularly concerning slow delivery." This process moves the AI from simply retrieving data to providing valuable, actionable insights.

Ultimately, data enrichment is vital for improving the accuracy and reliability of RAG systems. It helps to prevent the AI from "hallucinating" or generating inaccurate information, or surfacing things that might not come directly out of a traditional one-shot query, because it is grounded in verifiable and current facts from your own systems, but taken in smaller steps. When an application's knowledge is enriched, it can handle ambiguous queries, and its responses become more consistent and transparent, by providing users with a clearer understanding of the source. This does not only enhance the user experience, but also builds trust and credibility.

Metadata Enrichment

Metadata enrichment is equally critical, because it provides the AI agent with a “map” to navigate the data effectively. While data enrichment focuses on enhancing the content within a record (for instance, classifying a review as positive or negative), metadata enrichment is about providing descriptive context about the data sources themselves. This includes detailing what each table and field represents, their relationships, and the purpose of complex queries. An AI application, when faced with a user asking, "What does the field customer_since mean?", cannot just guess. It needs to retrieve the precise definition from the metadata, such as "Customer Online Registration Confirmation Date", to provide an accurate response.

By documenting the model's design, including field descriptions, data owners, and business objectives, you equip the AI agent with the knowledge to not only answer questions about the data itself, but also to understand the broader context of the business. For example, if a user asks for all product reviews with negative comments, the AI can consult the metadata to understand that "customer reviews" and "online shop data" are distinct data artifacts. It can also identify the correct field names to filter by "negative sentiment". This level of contextual awareness, powered by rich metadata, is what allows an application to perform complex tasks, ensuring its responses are not just factual, but also relevant to the user's specific query, and the underlying business logic. It also gives nuance to the schema representation of data. For example, you might add a description to a view that warns about the uniqueness of the records, or provides some constraints beyond what the technical model describes, allowing the LLMs to refine the potential queries to launch to find the right answers.

Data Model Documentation

Exposure/Selection of the Current Data Model

When exposing your data model, it is possible to provide access to the data "as-is," allowing users to explore the full breadth of available information. While this approach offers maximum flexibility, it can be a good idea to consider carefully selecting a subset of the data for exposure. By curating the data that is made available, you can minimize the risk of users misinterpreting or misusing information.

An easy approach could be to split an internal database with all the artifacts necessary to answer questions and provide all necessary insights, while an external DB will be a selection of the elements to synchronize/expose to the AI Agents.

This strategic exposure is especially critical in environments where the AI agent is intended for a specific purpose, or where sensitive data exists. For example, rather than exposing the entire customers_inventory view, which contains detailed legal names, contracts, and internal codes, you might create a simpler derived view in the externally exposed to AI database, that only exposes the account_name and customer_code. This ensures the AI agent and the end user only have access to the necessary information, eliminating the possibility of revealing sensitive details or a user building an incorrect query on a complex data structure. This promotes a safer, more efficient, and more reliable user experience.

How to approach it with Denodo

Denodo's capabilities extend to creating new virtual databases that can house a curated selection of derived views, based on a more complete “internal” Denodo database, offering a robust way to manage data exposure. This approach allows you to build a new, "AI-facing" database without altering the original data model. Within this new database, you can define derived views that select specific columns and apply filters or aggregations to existing views. This is an effective way to control which data elements are accessible to an AI agent or other users, minimizing the risk of information misuse. For example, you can create a derived view in the new database that only includes the account_name and customer_code from the original customers_inventory view, ensuring sensitive details like customer_legal_name are not exposed.

Furthermore, Denodo's Interface Views can be used as a "contract" to formalize and control data access. An interface view acts as a predefined structure that derived views must adhere to. By creating an interface view with a carefully chosen set of columns and data types, you establish a standard for all views intended for public consumption. This ensures that any derived view built for the new database, such as one exposing a subset of account information, will conform to the interface view's schema. This "contract" minimizes the chance of exposing unauthorized data and simplifies the management of data governance, as you can easily audit which views are compliant with the established access rules.

Anonymization

Anonymization of personal information is a critical security practice, particularly when exposing data models to AI agents or broader audiences. A key method for anonymizing sensitive data, such as account names or customer personal information, is made possible through a variety of techniques, hashing and masking among them.

Hashing involves replacing the original data with a fixed-length string of characters, making it computationally difficult to reverse the process and retrieve the original value.

Masking, on the other hand, involves partially or completely obscuring the data, for example, by replacing most characters with asterisks (*) and only showing the last few characters, such as ****6**9.

By implementing these anonymization techniques, you can ensure that the AI agent can still perform its intended functions without accessing personally identifiable information. For instance, an application might not need to know a customer's full name to answer a query. Instead, it can use the hashed or masked version of the account name to track and analyze trends across a group of customers, such as identifying if the "likelihood to revisit" is high for a specific set of anonymized accounts. This approach allows for valuable insights to be extracted from the data while strictly adhering to privacy and data protection regulations.

Aggregation

Aggregating data is an effective method for protecting individual privacy while still allowing for valuable analysis. By combining data from multiple individuals into summary statistics, you can disable the access to specific personal details. This process moves the focus from "what is a single customer doing?" to "what is the trend for a group of customers?" For example, instead of tracking each customer's individual "likelihood to revisit" score, a company can aggregate these scores by region or account type. The resulting data might show that the average "likelihood to revisit" is higher in North America than in Europe, without revealing the names or specific scores of any single person. This method allows businesses to gain market insights and identify broad trends without compromising the privacy of their customers.

This approach is particularly useful in scenarios where detailed individual data is not necessary for a business decision. By aggregating the data, you reduce the risk associated with handling personally identifiable information (PII) and ensure compliance with data protection regulations. The summarized information, such as the total number of customers who visited in a given quarter or the average spend per region, still provides a basis for strategic planning and business intelligence. This balance between data utility and privacy protection makes aggregation a fundamental technique.

Example of Data Model

This section provides a detailed breakdown of a sample data model (virtual database samples_bank, as part of the Denodo AI SDK installation) , including a catalog of all virtual views and their fields. It also contains a section per view giving suggestions for enrichment.

Databases

For this document, we are referring to two different Databases: the internal DB samples_bank, and the external DB samples_bank_ai.

The main internal database, that contains all the necessary data elements that we will expose in a second database, which will be the one exposed to the AI in our example.

Base Views for internal DB samples_bank

View: account

  • acct_id: int (The unique identifier of the account)
  • customer_id: int (The identifier of the customer)
  • acct_type: text (The type of account)
  • balance: double (The balance, this is the current amount of money in the account)
  • date_created: localdate (The account creation date)

View: customer

  • customer_id: text (The unique identifier of the customer)
  • first_name: text (The first name of the customer)
  • last_name: text (The last name of the customer)
  • email: text (The email of the customer)
  • phone_number: text (The phone number of the customer)
  • address: text (The address of the customer)
  • city: text (The city where the customer lives)
  • state: text (the state where the customer lives)
  • zip_code: text (the zip code where the customer lives)

View: loan

  • loan_id: int (A unique identifier for each loan)
  • customer_id: int (A unique identifier for each customer)
  • loan_amount: double (The total amount of money that has been loaned)
  • interest_rate: double (The interest rate applied to the loan)
  • term: int (The duration of the loan)
  • property_id: int (A reference to the property associated with the loan, if applicable)
  • loan_officer_id: int (A reference to the loan officer who managed the loan application)
  • status: text (The current status of the loan. Can be rejected, pending or approved)
  • date_created: localdate (The date the loan was originated)

View: loan_officer

  • loan_officer_id: int (A unique identifier for each loan officer)
  • first_name: text (The first name of the loan officer)
  • last_name: text (The last name of the loan officer)
  • email: text (The email of the loan officer)
  • phone_number: text (The phone number of the loan officer)

View: payment

  • payment_id: int (A unique identifier for each payment transaction)
  • loan_id: int (A reference to the loan for which the payment is made)
  • payment_amount: double (The amount of money paid in the transaction)
  • payment_date: localdate (The date on which the payment was made)

View: property

  • property_id: int (A unique identifier for each property)
  • address: text (The street address of the property)
  • city: text (The city in which the property is located)
  • state: text (The state in which the property is located)
  • zip_code: text (The postal code for the property’s location)
  • property_value: double (The assessed value of the property in dollars)

View: rate

  • rate_id: int (A unique identifier for each interest rate entry)
  • loan_type: text (The type of loan to which the interest rate applies)
  • term: int (The term length of the loan in years)
  • interest_rate: double (The interest rate for the loan, expressed as a percentage)

View: underwriting

  • underwriting_id: int (It is the underwriting identifier)
  • loan_id: int (It is the loan identifier. Relates to loan)
  • credit_score: int (It represents the credit score of an individual involved in the underwriting process. The credit_score field is an integer type)
  • employment_history: text ( It describes the employment history of the borrower. Example records include "Stable" and "Varied")
  • financial_history: text (It provides information about the financial history of the borrower. Sample records are "Good" and "Fair")

Associations: this model also defines several associations that describe the relationships between these views.

  • customer_accounts: links customer to account on customer_id.
  • customer_loans: links customer to loan on customer_id.
  • loan_payments: links loan to payment on loan_id.
  • loan_underwritings: links loan to underwriting on loan_id.
  • loanofficer_loans: links loan_officer to loan on loan_officer_id.
  • property_loans: links property to loan on property_id.

The views customer and property have a 'role_filtered' tag, which suggests that row-level security or access restrictions may be applied to them based on user roles.

Derived Views

This set of views starts constructing the base layer to enrich the data, therefore combining the previous views. We will illustrate three scenarios:

Hashing and reducing the amount of Personal Individual Information

View: iv_customer

  • customer_id: text (The unique identifier of the customer)
  • first_name_hashed: text (The first name of the customer - hashed)
  • last_name_hashed: text (The last name of the customer - hashed)
  • email_hashed: text (The email of the customer - hashed)
  • phone_number_hashed: text (The phone number of the customer - hashed)
  • address_hashed: text (The address of the customer - hashed)
  • city: text (The city where the customer lives)
  • state: text (the state where the customer lives)
  • zip_code: text (the zip code where the customer lives)

Considerations:  This is a derived view that provides a de-identified version of the customer data. The fields first_name, last_name, email, phone_number and address have been processed using a HASH function, which means the original, plain-text values are not directly available. This is typically done for privacy or security purposes. Alternatively these fields could obviously be simply removed. It might depend on the availability of other fields (like non identifiable IDs), otherwise they might still allow you to do groupings and processing without knowing the identities of the individuals.

Grouping things together

View: iv_loan_statistics

  • state: text (The state of the property associated with the loan)
  • zip_code: text (The zip code of the property associated with the loan)
  • date_created_year: int (The year the loan was created)
  • date_created_quarter: int (The quarter of the year the loan was created)
  • total_loans: long (The total number of loans in the grouping)
  • total_customers: long (The total number of unique customers in the grouping)
  • total_loan_amount: double (The total sum of all loan amounts)
  • avg_loan_amount: double (The average of all loan amounts)
  • total_property_value: double (The total sum of all property values associated with the loans)
  • avg_property_value: double (The average of all property values associated with the loans)
  • avg_interest_rate: double (The average interest rate for the loans)
  • avg_term: double (The average term of the loans in years)
  • approved_loans: long (The count of loans with an 'approved' status)
  • rejected_loans: long (The count of loans with a 'rejected' status)
  • pending_loans: long (The count of loans with a 'pending' status)

Considerations: This is a derived view that provides a statistical summary of loan data, grouped by geographical location and time. It combines information from the loan and property views. The total_loans and loan status counts are of type long because they are the result of an aggregation (COUNT and SUM of a constant) and could exceed the maximum value for an int. The other aggregated fields like avg_loan_amount and total_loan_amount are double to accommodate fractional values. This view is ideal for analytical purposes as it provides pre-aggregated metrics, reducing the need for repeated complex queries on the raw data.

In some cases, the AI might be able to figure out such queries without the need of pre-creating such views. In other cases though, where the time to process such a query might be more significant or there might be nuances to the relations between the objects, it might be easier to pre-fabricate some of these summary artifacts to make the processing easier.

Doing AI enrichment based on existing data

In some cases, we might want to get the assistance of LLM to enrich certain data based on text: sentiment analysis, classification and tagging, etc.

In our example, we can obtain some statistical information from each state that can help to enrich our overall statistics (we will ask the average and median price of a property in the same state. While this is just a basic example, the same principle can be applied to make a realistic use case).

  • View: iv_property
  • property_id: int (A unique identifier for each property)
  • address: text (The street address of the property)
  • city: text (The city in which the property is located)
  • state: text (The state in which the property is located)
  • zip_code: text (The postal code for the property’s location)
  • property_value: double (The assessed value of the property in dollars)
  • average_property_price_per_state: text (A value generated by the function ENRICH_AI()  representing the average property price per state based on general knowledge)
  • property_type: text (A classification with the function CLASSIFY_AI() of the property's location, generated by an AI model as either 'Urban', 'Suburban', 'Rural', or 'Commercial')

Considerations:  This is a derived view that adds a new field, average_property_price_per_state, which is generated with the enrich_ai function. This is a key distinction from other fields as the value is not derived from the underlying dataset but is an estimate provided by an external AI model based on the state field.

The property_type field is a classification generated by the classify_ai function. This AI-generated category is based on a combination of the address, zip_code, and state fields. It categorizes the property into one of the predefined types, which is useful for analysis but not a fact from the original data source.

In this case, just as an illustrative example, the fields have been generated by invoking the following code as part of the derived view.

One important aspect to keep in mind is that the LLM will be queried once per row on those functions. If the amount of data is significant, it will take long to obtain a full response. In such a case, it might be desirable to combine the Denodo Cache and the Denodo Scheduler cache load jobs to achieve the desired enrichment in chunks and make it seamless for the users.

Using Interfaces

Denodo offers the possibility of creating Interface Views that will help decouple the implementation of the data model. In our case, we could establish an interface view for every view that we will expose externally. If the implementation changes later, we will be able to change that without affecting the whole data model schema.

As implementation views, we will use the obtained derived views:

  • iv_customer
  • iv_property
  • iv_loan_statistics

Plus the base views:

  • account
  • loan
  • loan_officer
  • payment
  • rate
  • underwriting

For each of these, we will create interface views in our internal DB (based on one of them, we select “New Interface”, and in the implementation we drag and drop the view we are building it from).

Views for External DB samples_bank_ai

We will create a virtual database that is going to be synchronized against the AI (Denodo AI SDK or other vendors).

We will select the views from the interface views created before. In such views, we can still tweak and edit metadata definitions further than what is already available.

It is advisable to leave the model implementation in the internal query to avoid exposure of such construction to the AI.

Considerations on Security and Performance

Security

Enrichment functions and Denodo Assistant

While many thoughts on security relate to the exposure of the data and metadata as a whole, it is necessary to pay attention to the explicit enrichment AI functions too, since they can easily expose data “out of the radar”. Since these functions can be used with a general connection to the LLM, the developers with the necessary privileges will need to be careful on what data fields to expose.

It can be a good practice to tag such views and synchronize them in the Data Marketplace so users can identify whether some information is AI-enriched or not.

The developer must be granted the use_large_language_model role. This is a specific role designed to control access to all of the LLM functions, including enrich_ai. The user also needs the necessary privileges (like CONNECT and EXECUTE) on the database and views they are querying.

One possibility could be to link such a role with a particular Global Security Policy that masks/hides certain fields tagged with a particular Tag relating to personal information. This might prevent undesired information being used by the roles with access to such functionality.

The Denodo Assistant Configuration allows the exchange of sample data whenever there is exchange with the LLM. This is affecting the parts where Denodo can help reviewing code in Design Studio or automatically composing a query in Data Marketplace. This exchange of sample data is not used in the AI functions, which only use the field content used in the call to the function.

Exposure to AI (Denodo AI SDK or other vendors)

As discussed previously, exposure of personal information is one of the major risks in exchanging information with AIs, therefore the techniques covered before will disable the possibility of exchanging personal information with the AIs.

On top of this, there is the question of who has access to what when an AI system interacts with Denodo. In this case, the principles of Access Control in Denodo will ensure a solid security governance. See the Denodo Security Overview for more information.

One of the main principles is the use of Roles, which will enable or disable access to data products to users to Denodo Data Marketplace and Denodo Platform.

Performance

If the amount of enrichment is generous, the construction of such enrichment needs to be considered with care.

Assuming that one single row of a view can take a few seconds to obtain, it will be necessary to do some calculations in regards to the potential sum of time to enrich a whole dataset, and based on this, arrange the generation of such enrichment in chunks of time, for instance, with the help of Denodo Scheduler and the Denodo Cache.

It is also possible to consider the parallelization of many queries towards the LLM simultaneously, as much as the Denodo Platform is capable of managing. See “Maximum number of concurrent iterations” in the Denodo Scheduler documentation for example.

With this in mind, it is possible to configure the necessary amount of processing in a range of time and the potential parallelization necessary.

Disclaimer

The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.

For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.
Recommendation

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here