A group of Denodo servers sharing the same metadata can be clustered to offer High Availability (HA), scalability and improved performance. In essence, a Load Balancer presents a “virtual server” address to the outside world. When users attempt to connect, it forwards the connection to an available server, taking care of bi-directional network address translation in doing so.
Generally speaking, a basic load-balancing transaction works as follows:
The following elements are important to consider in determining how best to configure optimal performance and evenly-distributed load in a Denodo environment:
Figure 1: Denodo Cluster Setup
In a typical configuration like the one seen above, the load balancer contains a list of participating member nodes. Since a node is defined as a combination of IP and port numbers, clusters for JDBC (RMI Registry port 9999) as well as for web service access (HTTP port 9090) are defined. The client applications are now pointing to the load balancer address, using it as a “virtual server”. When connections are created, the load balancer routes them to a specific member of the cluster following whatever balancing strategy has been configured on it.
In the following section, we describe configuration and other considerations to keep in mind during the design and operation of a cluster of Denodo servers.
Denodo has been used in production clusters both with hardware and software load balancers, such as:
The configuration of a Denodo cluster in the load balancer should be a straightforward process as shown in Figure 1 above.
The list of nodes for a virtual server in the load balancer is configured based on server and port. Since Denodo uses different ports for RMI, ODBC and ADO.NET, and HTTP, we will need to configure three virtual servers. The default ports for the Denodo platform are:
Some recommendations when using the HAProxy/ELB:
With Denodo 8.0, firewall configuration for the JDBC interface is made simple. The JDBC (along with the administration tool and the new Design Studio) connections now uses only one port to communicate with Virtual DataPort Server (by default, 9999). In previous versions, the communication requires two ports (9999 and 9997). This change simplifies the configuration of firewalls.
Please note: ODBC, ADO.NET and HTTP connections do not use this port and therefore this consideration does not apply if those are the only access methods in use.
Although Denodo Servers are stateless (the result of a transaction will not depend on previous executions) the "server affinity" needs to be configured in the balancing process because of how the RMI protocol works. See Virtual DataPort Connectivity for more information.
Round Robin strategies tend to be the most effective in scenarios involving queries of similar “weight” (measured as a combination of response data per query to the sources and post-processing in Denodo), and where the nodes are equally powerful.
In mixed scenarios where “heavy” informational queries coexist with “lighter” operational queries, or when the technical specs of the nodes are different, round robin can lead to node overloads (see FAQ for more details). More complex load information can be gathered from the nodes and used for advanced status analyses (memory usage, CPU load, etc.). Load balancers usually have strategies that are able to gather and use this information to provide more intelligent balancing decisions.
For example, BigIP F5 offers a strategy called “Dynamic Ratio” to cover those scenarios (see appendix for details on how to configure this balancing strategy).
Denodo provides a health check script that checks a server to ensure it is capable of responding to queries. It can also execute a sample query and advertise the number of results returned, if necessary.
Most load balancers can use external scripts to check the status of member nodes (BigIP F5 refers to these as “External Monitors”). If this capability is not available, a ping to a node’s port can be used as a substitute. This option, although faster, provides a less reliable health check than the Denodo script.
For more information, refer to the section Connecting from a JDBC Client Through a Load Balancer and Using the Ping Script fn the Virtual DataPort Administration Guide.
Some load balancers replace the origin IP address of the client applications in the requests sent to the Virtual DataPort Server, in those cases, the property userAgent can be used to retrieve the IP of the original client application.
Client connections to the Denodo Server are normally managed by a connection pool. Initializing a new connection in a pool is costly, and in many cases unnecessary. With connection pooling, each newly-created connection is placed in the pool and can be reused, reducing the amount of new connections that will need to be generated. However, the use of connection pools with inadequate balancing strategies, in certain scenarios, can lead to uneven load distribution. You can find a more detailed discussion in the FAQ “Should I use servers with different specs as nodes of my cluster?”.
Another important aspect of connection pools is their ability to check the health of a connection with a ping query. In the context of a cluster, if a ping query fails, the pool considers the connection unusable and tries to obtain another one. Since different connections point to different servers from the same pool, this mechanism can always expose healthy connections to clients even when one or more servers go down.
There are a few connection properties that have to be set in the driver to take advantage of these clustering capabilities. These properties are fully documented in the Virtual DataPort Developer Guide, section “Access through JDBC”.
For JDBC connections, set the property “reuseRegistrySocket” to “false”. RMI-based connections will be more evenly distributed across the nodes of the cluster with this configuration.
Some connection pools do not support ping query timeouts (which determines how long to wait for a response before a connection is invalidated). The Denodo JDBC driver provides a “pingQueryTimeOut” property for these connection pools.
For example: if the URI of the JDBC driver has the properties
pingQuery=SELECT * FROM dual()&pingQueryTimeOut=500
then every time the pool executes the ping query, the driver will only wait 0.5 seconds for a response (pingQueryTimeOut is in milliseconds). Otherwise, the driver throws a SQLException because the TimeOut has been reached, and the pool will consider the connection invalid.
In the context of a Denodo cluster, the Virtual DataPort server cache can be configured in two different ways:
In selecting the appropriate architecture for the cache system of a cluster, considerations such as location of the cluster nodes and cache performance become relevant. For example, if you have nodes in China and the United States, you may decide to have two shared caches, one in each location, instead of moving results over a network for large distances.
For more information on how to set up the cache, refer to the Virtual DataPort Administration Guide, section “Configuring the Cache”.
In most deployments, a single, shared cache server is typically recommended:
However, there are some considerations to bear in mind:
When multiple Denodo server nodes are configured, only one of them should be configured for performing the necessary cache maintenance tasks (such as deleting expired data from the database). Having all nodes managing these tasks is redundant and will create an unnecessary performance impact on the cache database.
Configure the “Maintenance Period” setting with an adequate value in one of the Denodo servers and disable cache maintenance for every other server in the Denodo cluster by setting “Maintenance” to “Off” in the “General Settings” tab of the cache configuration.
In clustered environments where Denodo is running in multiple physical locations, a shared cache can introduce increased cache hit latency. Locally-deployed cache databases can provide faster local access.
To enable this architecture, simply use different cache servers in each node’s configuration.
When deploying independent Denodo cache databases as described above, it may be a good idea to replicate cached data from a designated “master node”. This simplifies the cache preloading process and reduces execution times if the original data sources are local to one of the nodes but remote to others.
When a use case or scenario involves only controlled cache pre-loads, the best option is to rely on vendor-specific database replication techniques. The cache loading process can be scheduled on the “master node”, and the data will automatically propagate to the other nodes of the cluster. Oracle, MS SQLServer and MySQL all include automatic replication tools that can be set up for this purpose.
When queries to the different Denodo nodes access distinct, non-overlapping data sets, it may make more sense to use a “Partial” type cache. Since each node controls its own insertions automatically, no cache replication should be required in this case.
In order to keep Denodo Cluster nodes in sync, Denodo metadata must be replicated as well.
Starting from Denodo 8.0 the metadata can be hosted on a separate RDBMS. Storing the Catalog on an External Database helps for an effective cluster setup. If the Virtual DataPort servers of a cluster share the same metadata (hosted in an external database), you only need to perform a change in one of the servers and it will be propagated to the others, automatically. Otherwise, you have to do the same change on each server of the cluster.
If you intended not to have a shared metadata for the clusters then like in earlier version, Denodo 8.0 includes the Solution Manager that provides support for metadata replication in different nodes of a cluster. The Clusters-Configuration and Promotions section of the Solution Manager Administration Guide provides more detailed information about this functionality for creating and managing clusters.
The Denodo Platform, starting from version 6.0, includes the Diagnostic and Monitoring Tool. This tool allows monitoring the current state of a Virtual DataPort server or an environment (group of servers). The Diagnostic & Monitoring Tool displays the status of the Virtual DataPort servers using graphs and tables that are updated every few seconds. More information about the monitoring capabilities of this tool can be found in the Monitoring section of the Diagnostic & Monitoring Tool Guide.
The Denodo Server exposes several internal events using JMX, an industry standard that can be integrated with any cluster monitoring software. Denodo has been successfully monitored from commercial and open source tools such as:
The information Denodo exposes includes:
Denodo’s out-of-the-box capabilities enable well-balanced clustering using commercial load balancers. Denodo clusters can be set up using almost any load balancer on the market, including popular hardware balancing solutions such as Big IP’s F5.
Connections to data services are carried out by HTTP requests issued to Denodo-published web services. The web services will manage their own connection pools connecting to a Denodo server node and issue the corresponding queries in an efficient way. In the context of a cluster, the most common configuration has each Denodo exposing data services with its own embedded web services container. The client will request a web service using the load balancer’s DNS and the load balancer will route each HTTP request (always stateless in this context) to a node in the cluster. The responding web service will then use its own connection pool to the co-located Denodo server running on the same node to execute the request.
Requests issued via JDBC or ODBC connections will be forwarded through the load balancer which will route the connection directly to one node.
If the client is using a connection pool, a connection can be used for other queries until the connection expires. While the connection remains alive, it will point to the node initially assigned to it by the load balancer at initialization time. Since each connection is balanced at initialization time, it will ensure an even distribution of the connections between the different nodes, and an even distribution of the queries executed using that pool.
Denodo recommends managing client application connections in a client connection pool. Apart from the performance benefits, client-side connection pools offer the capability of issuing a health check (ping query) before reusing a previously opened connection. If the ping query fails, the connection pool will choose a different, healthy connection (new or reused) pointing to a different node in the cluster. This ensures that every query will be directed to a healthy server.
When a server is removed from the cluster, two scenarios can result:
No, Denodo recommends all the nodes of the cluster to be servers of similar specs. Heterogeneous deployments may lead to poorly balanced situations, especially when round robin load balancing is used.
To illustrate the problems that may arise if those assumptions are violated, consider a cluster with 2 nodes of different capacity; Node A is faster than Node B. Consider the case of a client application with no connection pool that executes 10 concurrent queries on the cluster:
It is easy to see how this process can overload the slower node while wasting the processing power of the faster nodes. The workload is not distributed in the most efficient way with Nodes of different capacities and a round-robin balancing strategy.
There are 2 recommended solutions to the scenario of nodes with different capacities depending on whether we are using a connection pool or not:
If the workload of a cluster is very heterogeneous, it could lead to similar situations to the one described in the previous question, except that the queries themselves would take a different amount of time rather than the nodes having different processing speeds. For example if some queries take 100ms to get executed while others last up to 10 minutes, you may inadvertently send all of the 10 minute queries to 1 node and the 100ms queries to another. One would then be overutilized and the other underutilized.
In general, Denodo recommends having homogeneous clusters and homogeneous queries. If there are two different use cases or applications with very different query workloads, Denodo recommends setting up different clusters for each set so that the workload is more evenly distributed. This also allows more precise set ups tweaked to the specifics of each scenario; memory heap space, for example.
However, if the recommendation is not possible, the same considerations for nodes of different capacity around balancing strategies stated in the question above will apply here.