Semantic Search Using Embeddings¶
Semantic search enables you to query data based on the meaning and context of the content rather than simple keyword matching. This method is particularly effective for analyzing unstructured data, such as customer reviews, support tickets, or product descriptions.
In Virtual DataPort, you perform semantic search using vector data types and functions that calculate the distance between vectors. This approach allows you to measure the similarity and distance between two vectors..
Understanding Vector Embeddings¶
To perform semantic search, the Denodo Platform leverages data that has been transformed into vectors.
What are Vectors? In the context of Generative AI and semantic search, a vector (or “embedding”) is an array of numbers (e.g.,
[0.12, -0.98, 0.45, ...]). These numbers represent the semantic meaning of a piece of text in a high-dimensional space.Generating Vectors Vectors are generated by processing raw text through an Embedding Model (e.g., models from OpenAI, Google, or Amazon).
External System: Typically, large tables of vectors are generated outside Denodo and stored in a vector-capable database.
Denodo: You can generate vectors dynamically using the Denodo function EMBED_AI().
Measuring Similarity (Vector Distance) To identify similar meanings, the system calculates the mathematical distance between two vectors. In Denodo, this is handled by the VECTOR_DISTANCE() function.
High Similarity = Low Distance: A small distance between vectors indicates the text segments are semantically similar.
Low Similarity = High Distance: A large distance indicates the meanings are unrelated.
Warning
Consistency is Critical: You cannot compare vectors generated by different models. To perform a valid semantic search, the query text must be embedded using the exact same model and version used to generate the stored vectors in your database.
Data Storage Expectations The table containing the vector data is created and populated in a specialized repository or a relational database with vector capabilities outside of Denodo. Denodo then introspects this table.
Virtual DataPort can access vector data stored in sources such as:
Snowflake
Databricks
Google BigQuery
PostgreSQL (via the
pgvectorextension)Oracle Database 26ai
Microsoft SQL Server 2025
Configuration¶
To enable seamless semantic search execution within the Denodo Platform, specific configurations are required at both the Server and View levels.
Server Configuration¶
You must configure an embedding model in the Server configuration. This defines the connection to the LLM or embedding service that the function EMBED_AI() uses during execution.
When a query requires converting a string into a vector on-the-fly, the Virtual DataPort (VDP) server uses this configuration to call the external API.
Base View Configuration¶
When you introspect a table containing vector data (e.g., a PostgreSQL table with a vector column), you should configure the following in the Design Studio:
Source Type Property: Explicitly set which model was used to populate the column.
Column Description: Provide a meaningful description that details the information used to generate the vector (e.g., “Embedding generated from the customer review summary”).
To specify the Embedding model source type property of a vector column:
Open the Base View and go to the Edit tab.
Navigate to the column that holds the vector data.
Set the Source Type Property “Embedding Model”.
Save the changes.
Benefits of this Configuration
Automated Vector Generation: It enables the VDP server to automatically identify the required model during query execution. When you run a query comparing text to this column, VDP identifies the model needed to vectorize your input text, eliminating the need to manually specify it in every query.
Governance and Documentation: It serves as a metadata record, documenting exactly which model and version were used to generate the vectors (e.g.,
openai-text-embedding-3-smallvscohere-embed-english-v3.0). This prevents version mismatches and ensures developers know how to interact with the data.AI-Powered Features: The column descriptions allow the the Denodo Assistant for VQL Shell, the Denodo AI SDK, and the Denodo Assistant Chatbot tools to generate more accurate queries by understanding the semantic context of the vector data.
Examples¶
The following examples demonstrate how to perform semantic search on a view named support_tickets.
Scenario Setup
These examples rely on the following configuration:
View Structure: The
support_ticketsview contains unstructured data regarding IT issues, including:ticket_descriptionprioritydescription_vector: The pre-calculated embedding of theticket_description.
Metadata Configuration: The Embedding model source property of the
description_vectorcolumn is set toall-minilm:l6-v2. This informs the Denodo Platform which model generated the stored vectors.Server Configuration: The Server configuration is connected to an external model provider and configured to use the same model,
all-minilm:l6-v2.Execution: The queries use the function VECTOR_DISTANCE() to calculate the semantic similarity between the user’s query and the stored tickets.
Note: A lower distance score indicates higher similarity.
Base view with a vector column and the embedding model in the column source properties.¶
Basic Semantic Search (Ranking)¶
The following query searches for tickets related to “internet connectivity issues.”
SELECT
ticket_id,
ticket_description,
vector_distance(description_vector, 'My laptop keeps losing connection to the Wi-Fi') AS distance_score
FROM
support_tickets
ORDER BY
distance_score
LIMIT 3;
Results:
Notice that the system returns results about “Wireless” and “Network” even though the word “Wi-Fi” might not be present in every row.
ticket_id |
ticket_description |
distance_score |
|---|---|---|
501 |
Wireless signal drops intermittently when in the conference room. |
0.1245 |
302 |
Cannot connect to the office network; adapter shows no internet. |
0.1588 |
889 |
VPN disconnects frequently during video calls. |
0.2105 |
Hybrid Search (Semantic + Structured Filter)¶
Hybrid search combines semantic understanding with traditional structured filtering. This is useful when you need to find semantically relevant results that are strictly limited to a specific category, date range, or status.
In this example, we search for issues semantically similar to “login failure,” but we strictly filter to show only tickets marked as “Critical” priority.
SELECT
ticket_id,
ticket_description,
priority,
vector_distance(description_vector, 'Users cannot log into the portal') AS distance_score
FROM
support_tickets
WHERE
priority = 'Critical'
AND vector_distance(description_vector, 'Users cannot log into the portal') < 0.4
ORDER BY
distance_score;
Results:
The results exclude common password resets (Low priority) and focus only on critical outages, even if the phrasing varies (e.g., “SSO”, “Auth”).
ticket_id |
ticket_description |
priority |
distance_score |
|---|---|---|---|
112 |
Global portal outage: 503 Service Unavailable. |
Critical |
0.0512 |
205 |
SSO service is down; authentication tokens are being rejected. |
Critical |
0.0890 |
Recommendation System (“More Like This”)¶
You can use semantic search to build a recommendation engine or a “find similar” feature. Instead of comparing a user’s text input to the database, you compare the vector of an existing record to the vectors of all other records using a distance function such as VECTOR_COSINE_DISTANCE().
In this example, we identify a specific ticket (ID 1042) regarding a complex server crash. We then search for other tickets in the system that describe a similar problem to identify if this is a recurring pattern.
SELECT
t.ticket_id,
t.ticket_description,
vector_cosine_distance(t.description_vector, source.description_vector) AS distance_score
FROM
support_tickets t,
(SELECT description_vector FROM support_tickets WHERE ticket_id = 1042) source
WHERE
t.ticket_id <> 1042 -- Exclude the source ticket itself
AND vector_cosine_distance(t.description_vector, source.description_vector) < 0.2
ORDER BY
distance_score;
Results:
Assuming Ticket 1042 was “Server crash due to memory leak,” the query returns other tickets describing memory issues.
ticket_id |
ticket_description |
distance_score |
|---|---|---|
1045 |
Production application server threw OutOfMemoryError. |
0.0341 |
998 |
High RAM usage detected on cluster node 3 causing freeze. |
0.1120 |
Note
The threshold values (e.g., 0.2) and distance scores depend on the distance metric used (Cosine, Euclidean, Manhattan, or Inner Product) and the specific embedding model.
