Configuring the Memory Usage and Swapping Policy¶
Virtual DataPort provides three methods to avoid memory overflows when dealing with huge datasets:
Stop retrieving data from a source when the size of these data that have not been processed, exceeds a certain limit (“Maximum size in memory of intermediate results (MB)”).
The swapping mechanism, which stores in secondary storage the intermediate results of the execution of a query, if their size exceeds a certain limit. It also swaps the intermediate results of the sorting operations (queries with an
ORDER BY
involved). This is configured with “Swapping status” and “Maximum size in memory of each node (MB)”.Limit the maximum memory used by any query (“Maximum query size (MB)”).
To change these settings, click Server configuration on the menu Administration and then, click Memory Usage (see Memory Usage configuration).
By default, all the databases and their views use the memory usage and swapping configuration set in this dialog. However, a database or a view may have its own settings.
Read the section Configuring and Deleting Databases to learn how to change these settings for a specific database.
Note that the settings of a database override the Server ones and the settings of a specific view override the database ones.
Memory Usage Settings¶
When a user queries a view, Virtual DataPort retrieves the necessary data from the sources. In certain scenarios, the Server is able to stop obtaining data from these sources for a while and resume later. For example:
One of the sources may provide the required data faster than the other. E.g., you are querying a join view that involves two JDBC data sources. One of them connects to a database that runs on a very powerful machine and the other connects to another database that runs on a slow machine.
The query involves a join view with a complex join condition that takes more time to evaluate that what it takes to retrieve data.
In these scenarios, when the size of the data from a source that have not been processed yet, exceeds the value of the field Maximum size in memory of intermediate results (megabytes), the Server will stop retrieving more data from this source. After the Server has processed some of these data so the size of the data stored in memory decreases, the Server will resume obtaining data.
Swapping Settings¶
Click Swapping status on to enable the swapping engine for all the views that have the option “Swapping status Default” selected in its “Advanced” dialog. You can also change the default value of the following parameters:
Maximum size in memory of each node (megabytes). During the execution of a query, the Server swaps to disk the intermediate results produced during the execution of non-streaming operations, if these results exceed the maximum size specified in this parameter.
Maximum size of the blocks written to disk (Kbytes). Whenever the Server swaps to disk intermediate results, this is the maximum size of each read/write operation to disk.
Maximum disk space that can be used for swapping (MB). This is the maximum disk space in megabytes that can be used in swapping processes.
Maximum number of swap files that can be opened at the same time. This limit the number of swap files that the execution engine can open for reading or writing at the same time. It is useful to avoid running out of file descriptors on some OS when it is not possible to increment the hard limit imposed by the system.
The changes in these settings will become effective immediately.
Virtual DataPort swaps data to disk in the following scenarios:
Order by: In views/queries that perform an ORDER BY that is not delegated to the source. In this scenario, the Server has to wait until it receives all the rows of the result set to begin ordering them. If the size of the data received exceeds the Maximum size in memory of each node, the Server begins swapping to disk the data that has to be ordered. After receiving all the data, the Server begins ordering it.
Group by: In views/queries that perform a GROUP BY operation that is not delegated to the source. In this scenario, the Server has to wait until it receives all the rows of the result set to begin this operation. If the size of the data received exceeds the Maximum size in memory of each node, the Server begins swapping to disk the data that has to be ordered.
There are several exceptions under which the Server does not has to swap any data to disk:
When the view/query projects the value of a field
When the view/query projects the result of a “cumulative” function. That is, it projects these functions:
AVG
,COUNT
,SUM
,MAX
orMIN
.
Hash join: when the Server executes a hash join, first it loads all the data from the right branch of the join and then begins processing the data from the left branch. If the size of the data from the right branch exceeds the “Maximum size in memory”, the behavior of the Server changes depending on the join condition:
If the join condition uses the operators =, <, <=, > or >=, the Server sorts the data of both branches by the join attributes, and then switches the join strategy to merge. To be able to sort the data, it swaps data to disk.
If the join condition uses a different operator (e.g.
like
), the Server begins swapping to disk the data from the right branch of the join. After obtaining all the data from the right branch, it begins reading the swapped data block by block to execute the join.
Minus and intersect operations: in terms of memory management and swapping, these operations behave as hash joins. Therefore, the Server swaps data to disk in the same scenarios.
Merge join: merge join is the most efficient join method and its memory footprint is generally very low. However, there are two scenarios where its memory consumption is higher:
When in the rows from the right branch of the join, there are a lot of rows in which the join attributes have the same value.
Or, when the join condition uses a non-equal operator. I.e. instead of =, it uses <, <=, >, >=, like, etc.
In these scenarios, the Server has to keep in memory a lot of rows from the right branch of the join to be able to check if they match with the rows from the left branch. When the amount of data from the right branch exceeds the “Maximum size in memory”, the Server begins to swap to disk the data from the right branch.
When caching data: When the cache is enabled for a view and the Server executes a query that involves caching data of this view, the Server has to store the results in the cache database and at the same time, return these results to the client that executes the query. Storing data in the cache database is usually slower than returning it to the client. Therefore, the Server has to keep in memory the results that has not cached yet. When the size of this data exceeds the “Maximum size in memory”, the Server begins swapping to disk these results and slowly retrieving them to insert them in the cache database.
There is an exception to this: when you add the parameter
'cache_return_query_results' = 'false'
to theCONTEXT
clause of the query. In this case, the Server does not need to swap the results of the query to avoid a memory overflow.
Note
You can configure the use of “swapping” dynamically for a
specific query using the VQL CONTEXT
clause. See more about this in
the section Configuring Swapping Policies of the VQL Guide.
Maximum Size of a Query¶
By default, Virtual DataPort limits the total amount of memory used by a query. To change this limit, change the value on the Maximum query size (MB) box.
The section Memory Management explains how the memory management works in Virtual DataPort and its subsection Limit the Maximum Amount of Memory of a Query focuses on how the limit set in this wizard is enforced.
Automatic Out Of Memory recovery¶
Even with query memory limits in place, there are rare situations where the server can run out of memory. Examples include very high query concurrency or excessive memory consumption by drivers and other external libraries.
When Automatic Out of Memory recovery is enabled, Virtual DataPort will try to anticipate these situations and terminate the costliest queries to reduce memory consumption.
This process is triggered when the server observes that the average memory usage is above the configured threshold during a period of time and, at the same time, it observes that the JVM Garbage Collector has been actively working trying to release memory. Those are the clear symptoms of a system about to throw an OufOfMemory error. Under those circumstances, if this setting is enabled, the server will start to terminate queries, choosing the costliest queries until the server gets back to nominal parameters. Query cost is determined using heuristics based on metrics like query execution time, CPU usage, used swap size, etc… By default, the size in memory of the views involved in the query will not be taken into account, despite it would be an important metric. To take this metric into account, it should be enabled with the following setting:
SET 'com.denodo.vdb.interpreter.execution.memoryUsageCalculation.enabled' = 'true';
Enabling this metric could affect performance on scenarios with very high concurrency of small and quick queries.
Server administrators can configure the heap usage threshold and the size of the time window using the following two settings:
Max used JVM heap (% memory heap). This is the maximum heap memory usage that is considered “normal”. If the heap memory usage exceeds this value during a period of time, Virtual DataPort will trigger the process to terminate the costliest query to reduce memory consumption. The default value is 90% and we do not recommend a lower value. Maybe it can be increased in environments with very large heaps.
Time window to trigger the process (seconds). The number of seconds that the memory usage has to be above the value explained above. Default value is 15 seconds. Increasing this value could increase the odds of OOM errors in situations where the memory usage can grow very quickly.