Workers, CPU and Memory¶
The following properties of the values.yaml
define the characteristics of the Kubernetes cluster nodes:
presto:
# The cluster has numNodes = numWorkers + 2
numWorkers: 4
cpusPerNode: 16
memoryPerNode: 64
numWorkers: The number of the MPP Workers in the Kubernetes cluster.
The recommendation for running the Denodo Embedded MPP is to create a Kubernetes cluster with
N + 2
nodes, with no other applications running on the nodes:one for the Coordinator
one for the Embedded Hive Metastore and the Embedded PostgreSQL
N
for the Workers
then
numWorkers
must beN
.See also Sizing Recommendations for the Denodo Embedded MPP.
cpusPerNode: The number of cores of each node of the Kubernetes cluster.
The Denodo Embedded MPP requires a certain amount of CPU to process the data for a given query: it is recommended to start with nodes with at least 16-32 cores. For example, in Amazon Elastic Kubernetes Service you can start with
m6a.8xlarge
orr6a.4xlarge
nodes.In general, if you double the CPU in the cluster, keeping the same memory, the query will take half the time. More CPUs mean shorter queries.
memoryPerNode: The total memory in GB of each node of the Kubernetes cluster. This setting determines the memory available to the MPP Coordinator and each MPP Worker, it will be the 80% of the total memory of the cluster node.
The Denodo Embedded MPP requires a certain amount of memory to process queries with
JOIN
,GROUP BY
, andORDER BY
operations: it is recommended to start with nodes with at least 64-128 GB of memory. For example, in Amazon Elastic Kubernetes Service you can start withm6a.8xlarge
orr6a.4xlarge
nodes.The memory available determines the maximum number of concurrent queries that can be run in the Denodo Embedded MPP.
Memory Settings for Queries¶
It is important to adjust the Denodo Embedded MPP memory settings for query performance, finding a balance between maximum memory per query and the maximum number of concurrent queries that can be run in the Denodo Embedded MPP.
The maximum amount of user memory a query can use on a MPP Worker is query.max-memory-per-node= JVM max memory * 0.1
by default.
The JVM maximum memory is the 80% of the total memory of the cluster node.
Therefore, the memory available for executing queries in a MPP Worker is memoryPerNode * 0.8 * 0.1
.
With very large queries the Embedded MPP could throw Query exceeded per-node user memory limit of xGB
, Query exceeded per-node total memory limit of xGB
or Query exceeded distributed user memory limit of xGB
, meaning that it needs more memory to handle queries.
In this case you can configure the memory settings in the values.yaml
to override the default values:
presto:
coordinator:
additionalConfig: [
query.max-memory-per-node=xGB,
query.max-total-memory-per-node=yGB,
query.max-memory=zGB
]
…
presto:
worker:
additionalConfig: [
query.max-memory-per-node=xGB,
query.max-total-memory-per-node=yGB,
query.max-memory=zGB
]
Most important memory properties are:
query.max-memory-per-node
: the maximum amount of user memory a query can use on a MPP Worker. The default value isJVM max memory * 0.1
.If this is not enough, our recommendation as a starting point is to set
query.max-memory-per-node = JVM max memory * 0.5
.Increasing the default
query.max-memory-per-node
can improve the performance of large queries, but also may reduce the available memory for other queries in highly concurrent scenarios.query.max-total-memory-per-node
: the maximum amount of user and system memory a query can use on a MPP Worker. The default values isquery.max-memory-per-node * 2
.If this is not enough, our recommendation as a starting point is to set
query.max-total-memory-per-node = JVM max memory * 0.6
.Increasing the default
query.max-total-memory-per-node
can improve the performance of large queries, but also may reduce the available memory for other queries in highly concurrent scenarios.query.max-memory
: the maximum amount of user memory that a query can use across all MPP Workers in the cluster. The default value is20GB
.If the cluster needs to handle bigger queries, you need to increase
query.max-memory
, our recommendation isquery.max-memory = query.max-memory-per-node * numWorkers
.You can also run the EXPLAIN ANALYZE (or EXPLAIN (TYPE DISTRIBUTED) if
EXPLAIN ANALYZE
throws anyQuery exceeded
error), with these big queries, from an external JDBC client like DBeaver, to examine the query plan and try to optimize it by gathering statistics for each view involved in the queries, by adding filter conditions to reduce the amount of data to be processed, etc..In addition to adjusting the memory settings, sometimes, the only solution to handle large queries is to use cluster nodes with more memory or adding more nodes to the cluster.
CPU and Memory Management in Kubernetes¶
Kubernetes schedules pods across nodes based on the resource requests and limits for CPU and Memory. If a container pod requests certain CPU and/or memory values, Kubernetes will only schedule it on a node that can guarantee those resources. Limits, on the other hand, ensure that a container pod never exceeds a certain value.
presto:
coordinator:
resources:
limits:
cpu: 12.8
memory: 56Gi
requests:
cpu: 12.8
memory: 56Gi
...
worker:
resources:
limits:
cpu: 12.8
memory: 56Gi
requests:
cpu: 12.8
memory: 56Gi
Note that resources
are commented out, as we leave this setting as a choice for the Kubernetes cluster administrator.