MongoDB Sources¶
This section explains how to create MongoDB data sources and later, create base views over them. You will query these base views in order to retrieve data from these data sources. It covers the following topics:
Creating a MongoDB Data Source¶
To create a MongoDB data source, click the menu File > New > Data source and then, click MongoDB.
Enter this information:
“Connection” tab
Name. Name of the new data source.
Connected by. Choose the way to specify the connection information, either by host and port (Connection parameters) or with a complete URI (MongoDB URI).
To connect to a MongoDB Atlas cluster it is necessary to use the MongoDB URI option, this documentation indicates how to obtain the complete URI: Find Your MongoDB Atlas Connection String. Here is an example:
Connecting to a MongoDB Atlas cluster:
mongodb+srv://username:password@cluster0.example.mongodb.net/?retryWrites=true&w=majority&appName=cluster0
Connecting to a self-hosted MongoDB standalone deployment:
mongodb://username:password@denodo.com:27017/defaultauthdb
In both examples, the credentials
username:password
can be omitted in the URI and configured in the Authentication section explained below.Use TLS. Select the type of TLS to use in the connection with the database. See more about the
Mutual (two-way TLS)
in the section below Mutual Authentication.Authentication. Select the authentication method and enter the name of the database associated with the user credentials in the Database field (this is equivalent to connection option authSource of MongoDB).
The authentication options are:
No authentication.
Login and password.
Kerberos with login and password. The data source will connect to the database using Kerberos authentication, with the login and password you enter.
Kerberos with keytab. The data source will connect to the database using Kerberos with the login and keytab file you enter.
Use Pass-through session credentials. If selected, when a client application queries a view that uses this data source, the data source will use the credentials of the user to connect to this database. The behavior changes depending on the authentication method used by the client application to connect to Virtual DataPort:
If the client connects using Kerberos authentication, the Server will request a Kerberos ticket to connect to the database on behalf of this user. Note that for this option to work, the database must also support Kerberos authentication.
If the client connects using login and password, the Server will use this login and password to connect to the database.
If the database only supports Kerberos authentication, select Requires Kerberos so the Server uses the login and password of the client to obtain a Kerberos ticket to connect to the database.
The feature “pass-through session credentials” is only for queries. You have to provide the credentials that the Server will use to connect to the database during the introspection process (i.e. to display the collections of the database in the “Create base view” wizard of the data source). That is why, when you select this authentication method, the panel Use other type of authentication to create base view is displayed. In this panel you have to select a different authentication method to connect to the database during the introspection process (i.e. to display the tables/views of the database in the “Create base view” wizard of the data source).
When the data source is created with this option, the Server creates a pool of connections for each new user that runs a query involving this data source. Initially, these pools only have one connection to prevent the creation of many connections. The maximum number of connections for each one of these pools is the value of the field “Maximum number of active connections” of the connection pool. See information in the section Connection Pool configuration.
Warning
Be careful when enabling the cache on views that involve data sources with pass-through credentials enabled. The appendix Considerations When Configuring Data Sources with Pass-Through Credentials explains the issues that may arise.
Credentials vault. The data source will obtain the credentials to connect to the database from the Credentials Vault that the administrator of Virtual DataPort chose in Server Configuration. It will use the authentication method you selected (login/password or Kerberos). There are two types of configurations for these authentications:
Single secret. Select this when the Credentials Vault of your organization stores the login and password in single secret (i.e. CyberArk or AWS Secrets Manager). If you select this, enter the Account name that holds the credentials.
One secret per field. When the credentials vault used stores each credential at a different secret (i.e. Azure Key Vault). If you select this, configure each field indicating if its value comes from the credentials vault and put its secret name or value depending on the case.
From vault. Indicates if the field is retrieved from the credentials vault. If you select this, enter the secret name that this field has in the credentials vault.
Click Data Source credentials vault to override the configuration set at server level. In the popup, clear the check box Default configuration and enter the settings for this data source.
Mutual Authentication¶
To achieve a higher level of security, consider enabling mutual authentication to the connections with MongoDB by enabling Mutual Authentication. With this type of authentication (also known as “two-way SSL/TLS”), the client application that connects to MongoDB (in this case Virtual DataPort) has to present a certificate to validate its identity.
To enable this, select Mutual (two-way TLS) and provide this:
Certificate password: password of the file that contains the private key that the data source will use.
Click Load certificate and select the file that contains the private key. The supported formats are PKCS#12 and JKS.
Connection Properties¶
To specify connection options, click Connection properties. The documentation of MongoDB - Connection String Options lists the options you can add here.
Alternatively, you can add them to the connection URL but adding them here is more convenient.
If you specify a property both in the URL of the data source and in the Connection properties, the value applied to the connection will be the one in the Connection properties dialog (the value in the URL will be ignored). This also applies to the authentication method; the data source will use the authentication method selected in Authentication, not the one in the URL.
Connection Pool Configuration¶
To optimize the connections to the databases, the data source creates a pool of connections to the database (one pool per MongoDB data source). The benefit of having this pool is that, instead of opening a new connection every time it has to execute a query in MongoDB, the connection is already opened. This improves the response time of the queries.
To configure the settings of this pool, click Connections Pool Configuration:
Maximum number of active connections: maximum number of active connections to this database that the pool will open (default value: 100).
Minimum number of active connections: minimum number of active connections to this database that the pool will open (default value: 0).
Connection timeout (milliseconds): connection timeout in milliseconds (default value: 10,000).
Socket timeout (milliseconds): time in milliseconds to attempt a send or receive on a socket before the attempt times out (default value: 0).
Wait queue timeout (milliseconds): maximum time in milliseconds that a thread can wait for a connection to become available (default value: 120000).
Metadata Tab¶
In the Metadata tab, you can:
Set the folder where the data source will be stored.
Provide a description.
Set the infrastructure information where the data source is deployed. You can set the name of: - The infrastructure provider (e.g. on-premise, Amazon Web Services, Microsoft Azure, Google Cloud Platform, etc.) - The region name: you can type a custom name if the provider name or region is not on the list.
After you enter all the parameters on this dialog, click Test connection to check that the data source is able to access the configured source.
Creating Base Views from a MongoDB Data Source¶
Once you create the data source, create a base view that represents a collection in your MongoDB database.
To create a MongoDB base view follow these steps:
Open the MongoDB data source and click Create base view.
The wizard displays a tree with the databases of the source. Expand the databases to inspect its collections.
The wizard loads the names of the collections upon request, when you expand a database. That is because retrieving the names of all the collections of all the databases at once could be a very time-consuming task. Therefore, when you enter the name of a collection in the search box, the wizard only searches in the databases you already expanded.
Select the the collection that you want to create a base view from. Since MongoDB is a dynamic storage system and documents in a collection can have different schemas, VDP will analyze these documents and create a common schema that includes all fields from the analyzed documents.
In Limit documents to analyze you specify the maximum number of documents to be analyzed in a collection, to extract its schema. This limit prevents the analysis of all the documents in a collection and therefore avoid time-consuming tasks.
Enter 0 to disable this limit.
Select Initial aggregation pipeline to enter a JSON pipeline that will be applied to the documents in the selected collection.
The fields resulting from this pipeline will constitute the schema of the base view. The section below MongoDB Aggregation Pipelines provides more details about this.
Click Browse to select the folder where the base view(s) will be created. In this dialog, you can create new folders or rename the existing ones (right-click on this dialog to display these two options).
Click Create selected. The wizard will open a new tab with the new base view where the following actions can be performed:
Change the name of the new base view.
Change the name and type of the new base view’s attributes.
Edit the “Source type properties” of the field by clicking the button . In this dialog, you can define the exact type of the field and depending on the type, its length and number of decimals.
See more about these properties in the section Viewing the Schema of a Base View.
Remove attributes from the view. Several fields can be selected and deleted from the view’s schema by clicking on the Remove button. Interpolation variables (if the pipeline contains any) cannot be removed as they are necessary for the correct execution of the view. The Restore fields button will undo the deletion and restore the original fields.
Note
Actions Remove and Restore fields can only be executed before saving the view. Fields removed from the view will not be retrieved when executing a Source Refresh (even if you use a pipeline that projects those fields). It is necessary to recreate the view to recover the deleted fields.
Change the primary key definition of the view.
See section Primary Keys of Views for more information about primary keys of views.
In the Metadata tab, set the folder where the base view will be stored and provide a description.
When editing the data source, you can also change its owner by clicking .
MongoDB Aggregation Pipelines¶
When creating a base view over a MongoDB collection, you can specify an aggregation pipeline to filter the documents of that collection. The result of the pipeline will be the schema of the base view.
To use an Aggregation Pipeline, select Initial aggregation pipeline and enter it in the text area that is displayed.
An aggregation pipeline consists of one or more states that process the documents (see more information in about their syntax in Aggregation Pipeline). A pipeline is a JSON array and each element of the array is a state that produces a result. Some examples are shown below:
Projection
SQL
SELECT iinc_id, summary, taxid FROM internet_inc
Aggregation pipeline
[{ "$project": { "iinc_id": 1, "summary": 1, "taxid": 1, } }]
Selection
SQL
SELECT iinc_id, summary, taxid FROM internet_inc WHERE iinc_id = 1
Aggregation pipeline
[ { "$project": { "iinc_id": 1, "summary": 1, "taxid": 1, }, }, { "$match": { "iinc_id": 1 } } ]
Interpolation Variables¶
The pipeline may have interpolation variables (see section Paths and Other Values with Interpolation Variables), which allow the Aggregation Pipeline sent to the database to be parameterized according to the specified query conditions.
If the pipeline has interpolation variables, after clicking “Create Selected”, enter the values of the variables used in the pipeline. With these values, Virtual DataPort will execute the pipeline and obtain the metadata required to create the base view.
Each interpolation variable must be related with an attribute belonging to the generated base view, so that the variable can obtain its value at run time. If the Aggregation Pipeline used to access the data source returns any field with the same name as the variable, then the variable will be associated to the corresponding base view attribute. If the Aggregation Pipeline does not return any field with the same name as the variable, Virtual DataPort will add a new attribute to the base view, with the same name. This attributes cannot be removed from the view.
In your pipeline, search for the following characters and escape them with \
: @
and ^
.
That is, replace them with \@
, and \^
. These characters have to be escaped to avoid they are
considered part of an interpolation expression.
Delegation Capabilities and Limitations¶
Virtual DataPort is able to delegate to the source some operations such as
ORDER BY
, GROUP BY
, numeric functions, window functions, etc. Since MongoDB is a NoSQL and
non-relational storage system, the behavior of some functions is different from JDBC;
therefore, the obtained results may also differ. This section explains some behaviors
to consider regarding MongoDB functions:
When sorting by a field in MongoDB,
null
values are returned at the beginning, unlike JDBC which returns them at the end.The
sum
function convertsnull
values to 0. For example, in JDBC when summing an entire row ofnull
values the result will benull
, however in MongoDB the result is “0”.Window functions:
Range windows require all values of the
ORDER BY
to be numbers, if there is other data type or anull
value it returns an error.Range windows can only sort by one field and the sort must be ascending.
Functions
RANK
andDENSE_RANK
only allow to sort by one field.The function
STDEV
returnsnull
if the sample consists of a single numeric value.As previously mentioned
null
values in MongoDB are returned first. The result of window functions that depend on the order of the tuples to build the result (such asLEAD
andLAG
) could be different than JDBC.
The complete list of delegation capabilities is shown below:
Clauses |
SELECT, DISTINCT, WHERE, ORDER BY, GROUP BY, HAVING, FETCH LIMIT, OFFSET |
Operators |
=, !=, <, <=, >, >=, IN, EXISTS, LIKE, ISNULL, ISNOTNULL |
Aggregation Functions |
AVG, COUNT, FIRST, LAST, MAX, MIN, STDEV, STDEVP, SUM |
Analytic Functions |
AVG, COUNT, DENSE_RANK, FIRST_VALUE, LAG, LAST_VALUE, LEAD, MAX, MIN, RANK, ROW_NUMBER, SUM, STDEV |
Datetime Functions |
GETDAY, GETHOUR, GETMILLISECOND, GETMINUTE, GETMONTH, GETSECOND, GETWEEK, GETYEAR |
Numeric Functions |
ABS, ACOS, ASIN, ATAN, ATAN2, CEIL, COS, DEGREES, DIV, EXP, FLOOR, LN, LOG, MAX, MIN, MOD, MULT, POWER, RADIANS RAND, ROUND, SIN, SQRT, SUBTRACT, SUM, TAN, TRUNC |
Text Functions |
CONCAT, INSTR, LEN, MAX, MIN |
Other Functions |
COALESCE |