You can translate the document:

Overview

The Denodo AI SDK Evaluator is a tool designed to evaluate the performance and correctness of queries generated by the Denodo AI SDK offering actionable automated feedback.

This document provides a guide to the evaluation features, output, and usage. We will walk through how the tool works, how to interpret its results, and how to get it up and running in your own environment. It’s designed for developers and data engineers working with Denodo's AI SDK who need to validate and optimize their AI SDK deployments in the following development  scenarios:

  • Determine if your model and/or prompt and/or context changes improved accuracy.
  • Identify root causes of data quality issues.
  • Compare different versions of LLMs.
  • Verify that version upgrades of the AI SDK  did not cause regressions.

How It Works: The Evaluator Pipeline

The evaluation process is orchestrated by combined_eval.py and involves several distinct stages that run in sequence.

  1. VQL Generation to generate a QA Dataset
    The process begins with the
    ai_sdk_utils.py script, which reads an input Excel file containing your questions and ground truth VQL. For each question, it calls the Denodo AI SDK's /answerQuestion API endpoint in a multithreaded manner to generate VQL query responses to test. It also collects associated timing metadata from the AI SDK response, creating an initial DataFrame that serves as the input for the next stages.
  2. F1 Eval
    Next, the
    f1_eval.py script evaluates the accuracy of the data returned by the generated VQL. It executes both the generated and ground truth queries against your Denodo Data Catalog instance and compares the result sets to calculate the F1 score, Percent Overlap, and Subsetting Percentage.
  3. VES Eval
    The
    ves_eval.py script then assesses execution efficiency. First, it verifies that a generated query produces results identical to its ground truth counterpart. If the results match, it executes both queries multiple times to get a stable average execution time. The ratio of these times is used to calculate the VES. If the results do not match, the VES score is automatically 0.
  4. Combined Reporting in Excel
    Finally,
    combined_eval.py script merges all the collected metrics from the previous stages; AI SDK metadata, F1 scores, and VES scores, into a single, comprehensive DataFrame. It then uses this DataFrame to generate the final Excel report with its Summary and Details sheets and native visualizations.

Understanding the Output

The Denodo AI SDK Eval tool provides an analysis of questions submitted in a dataset, focusing on both the correctness of the results and the efficiency of the execution.

After each execution the Eval tool produces a comprehensive Excel file with two distinct sheets: Summary and Details. This report is designed to give you both a high-level overview and a granular, query-by-query analysis.

  • Summary Sheet
    This sheet offers an aggregated overview of the assessment, grouped by difficulty level. It displays key metrics that are embedded with native Excel charts for easy visualization and post-processing.
  • Percent Subset: The average percentage of matching values between rows of the same index.
  • Percent Correct Queries: The percentage of generated queries whose result sets perfectly match the ground truth.
  • Number of Samples: The total count of queries in each difficulty group.
  • Mean Time (s): The average total execution time from the AI SDK.
  • Time Std Dev: The standard deviation of AI SDK execution times.
  • Details Sheet
    This sheet contains the individual evaluation results for each question/query pair, providing a deep dive into every assessment.
  • Question ID & Difficulty: Identifiers for each query.
  • VQL Generated & Ground Truth VQL: The full VQL for both the AI-generated query and the reference query.
  • Results Match: 1 if the generated results exactly match the ground truth, 0 otherwise.
  • Subsetting Percentage: The percentage of generated results found in the ground truth, where the order of rows matters.
  • Percent Overlap: The percentage of common elements between the two result sets, where order does not matter.
  • Have Same Row Count: 1 if the row counts match, 0 otherwise.
  • Bird Standard F1: An F1 score based on cell-wise matches, derived from the BIRD benchmark methodology.
  • VES Score: The Valid Efficiency Score, indicating execution efficiency relative to the ground truth.
  • AI SDK Timing Metrics: Includes sql_execution_time, vector_store_search_time, llm_time and total_execution_time.

Interpreting Key Metrics

To quickly assess performance, focus on these key metrics:

  • Percent Correct Queries (Summary): This is your primary, strict measure of accuracy. It tells you what percentage of queries are perfect.

Percent correct is a good metric for single tuple ground truths. In more complex questions, the binary result of this metric may not capture performance on queries that return a larger table of data.

  • Percent Subset (Summary & Details): This indicates how well the content of generated results aligns with the ground truth on a row-by-row basis. High values suggest the AI is retrieving relevant data accurately, even if the structure is not perfect.

Example:

Ground Truth rows:

[(1, 2, 3), (4, 5, 6)]

Predicted rows:    

[(1, 2, 7), (4, 8, 6)]

Row 1: GT=(1,2,3) vs Pred=(1,2,7)

  - Subset: {1, 2} = 2 elements

  - Score: 2/3 = 0.667

Row 2: GT=(4,5,6) vs Pred=(4,8,6)  

  - Subset: {4, 6} = 2 elements

  - Score: 2/3 = 0.667

Aggregated:

               - 0.667 + 0.667 / 2 (Number of Rows) = 0.667 

Unlike Percent Correct, Percent Subset is not sensitive to column order making it useful for large result sets with multiple columns. This is useful because LLM’s can over or underspecify, such as the use of SELECT * .. instead of declaring a specific column. In a similar manner LLM’s can reorganize columns, for example If the ground truth is (first_name, last_name) but the SDK returns (last_name, first_name) with the same values, this metric can still show a strong match because they compare the rows as sets of items, not the exact values at column positions.

  • VES Score (Details): For queries that are 100% correct, this score tells you how efficient they are. A higher score is better, indicating performance on par with or better than the ground truth

Ground Truth Result:

[('John', 30), ('Alice', 28), ('Bob', 35)]

Predicted Result:

[('John', 30), ('Alice', 28), ('Bob', 35)]

Result:

EXACT MATCH

Execute each query multiple times

  • Iteration 1:

- Predicted time: 0.15 seconds

- Ground truth time: 0.12 seconds

- Ratio: 0.12/0.15 = 0.8

  • Iteration 2:

- Predicted time: 0.14 seconds  

- Ground truth time: 0.13 seconds

- Ratio: 0.13/0.14 = 0.93

  • Iteration 3:

- Predicted time: 0.16 seconds

- Ground truth time: 0.11 seconds  

- Ratio: 0.11/0.16 = 0.69

Average time ratio = (0.8 + 0.93 + 0.69) / 3 = 0.81

Reward scale:

  • ratio ≥ 2.0:     reward = 1.25  
  • 1.0 ≤ ratio < 2: reward = 1.0  
  • 0.5 ≤ ratio < 1: reward = 0.75  
  • 0.25 ≤ ratio < 0.5: reward = 0.5
  • ratio < 0.25:    reward = 0.25 )

Final reward = 0.75

Unlike the other metrics, this seeks to diagnose query performance. The VES Score is useful for ensuring the model is not generating technically correct but horribly inefficient SQL.

Bird Standard F1 (Details): This offers a more forgiving, cell-level correctness score, which is useful for understanding partial matches.

Ground Truth rows:

[(1, 2, 3), (4, 5, 6)]

Predicted rows:    

[(1, 2, 7), (4, 8, 6)]

Row 1: GT=(1,2,3) vs Pred=(1,2,7)

  - Match: {1, 2} = 2 elements

  - Match score: 2/3 = 0.667

  - Pred-only: 1/3 = 0.333

  - Truth-only: 1/3 = 0.333

Row 2: GT=(4,5,6) vs Pred=(4,8,6)  

  - Match: {4, 6} = 2 elements

  - Match score: 2/3 = 0.667

  - Pred-only: 1/3 = 0.333

  - Truth-only: 1/3 = 0.333

Aggregated:

  - TP (true positives) = 0.667 + 0.667 = 1.334

  - FP (false positives) = 0.333 + 0.333 = 0.666

  - FN (false negatives) = 0.333 + 0.333 = 0.666

  - Precision  = 1.334 / (1.334 + 0.666) = 0.667

  - Recall = 1.334 / (1.334 + 0.666) = 0.667

  - Bird F1 = 2 * 0.667 * 0.667 / (0.667 + 0.667) = 0.667

F1-score is a more lenient metric that reduces the impact of column order and missing values in the tables produced by predicted VQL queries. For more information see Soft F1-Score.

Getting Started

Before you begin, ensure your environment meets the necessary requirements. The following steps will guide you through the installation and configuration process.

Prerequisites

  • Python 3.10+
  • Denodo Platform 9 or higher, with an accessible Data Catalog.
  • Denodo AI SDK deployed and accessible via its API.

Installation and Configuration

  1. Navigate to the project root directory:

cd path/to/aisdk-eval-main

  1. Create and activate a Python virtual environment:

# Create the virtual environment

uv venv

# Activate the environment

# On Windows:

.venv\Scripts\activate

# On Linux/macOS

source .venv/bin/activate

  1. Install the required dependencies:

uv pip install -r requirements.txt

  1. Configure your Data Markeplace connection:
    Modify the file named
    project_config.env inside the eval/ directory and add your Denodo Data Marketplace connection details.

DATA_CATALOG_URL=http://your-denodo-server:9090/denodo-data-catalog

DATA_CATALOG_EXECUTION_URL=http://your-denodo-server:9090/denodo-data-catalog/public/api/askaquestion/execute

DATA_CATALOG_SERVER_ID=1

DATA_CATALOG_VERIFY_SSL=0

Running an Eval

The way to run an eval is by using combined_eval.py, which orchestrates the entire pipeline.

  1. Prepare your input file. You will need an Excel file (e.g., sample_input.xlsx) with columns for your questions, ground truth VQL, and difficulty levels. The default column names are "Question", "Solution", and "difficulty", you can also optionally add additional context.
  2. Navigate to the eval directory:

cd path/to/aisdk-eval-main/eval

  1. Run the evaluation script. The following command runs the full pipeline, taking your input file and generating a complete report.

Windows

python combined_eval.py ^

  --input "../sample_input.xlsx" ^

  --output "../results/evaluation_results.xlsx" ^

  --api-url "http://your-aisdk-server:8008/answerDataQuestion" ^

  --user "your_api_user" ^

  --password "your_api_password" ^

  --question-rows 5

Linux

python combined_eval.py \

  --input "../sample_input.xlsx" \

  --output "../results/evaluation_results.xlsx" \

  --api-url "http://your-aisdk-server:8008/answerDataQuestion" \

  --user "your_api_user" \

  --password "your_api_password" \

  --question-rows 5

This will generate an Excel file containing the evaluation results. For a full list of command-line parameters, refer to the script's help menu with:

python combined_eval.py --help

Conclusion

In summary, the Denodo AI SDK Evaluator provides a framework for validating AI-generated queries. By offering detailed metrics on correctness, overlap, and efficiency, development teams can quickly pinpoint weaknesses, optimize AI behavior through better prompting, and ultimately deploy more reliable and performant solutions. This tool is an essential part of the development lifecycle for any project leveraging the Denodo AI SDK, ensuring that the final product meets the highest standards of quality and accuracy.

References

Denodo AI SDK Evaluator 

Disclaimer

The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.

For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.
Recommendation

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here