USER MANUALS


Query Performance Optimization

You can fine-tune the performance of the Embedded MPP engine to match your specific analytical needs. The following configuration properties are available and can be adjusted based on your workloads to achieve optimal efficiency and speed. They are configurable in values.yaml in presto.config section:

General optimization properties

Property name

Default Value

Description

enableDynamicFiltering

true

This optimization significantly improves the performance of queries with selective joins by avoiding reading data that would be filtered by join condition.

useHistograms

false

Enables the collection of histogram statistics for Iceberg tables. Using histograms provides more accurate row count estimate following filter operations, especially on skewed data. This feature is disabled by default. It should be enabled when managing skewed data to take advantage of the improved statistics.

joinMaxBroadcastTableSize

615MB

This property controls the maximum estimated size of a table that can be broadcast when using AUTOMATIC join distribution type. This allows improving cluster concurrency and prevents bad plans when CBO misestimates size of joined tables.

optimizedRepartitioning

true

Enabling optimized repartitioning should show CPU performance improvement.

partialAggregationStrategy

automatic

With the automatic strategy the optimizer decides for each aggregation, e.g: disable partial aggregation when size reduction is below threshold

generateDomainFilters

true

Infer predicates from column domains during predicate pushdown. These generated predicates can significantly reduce input data scanned, thereby reducing execution cost.

httpServerThreadsMax

1000

Specifies the maximum number of threads available for coordinator node for processing requests.

maxPendingSplitsPerTask

2000

The number of outstanding splits with the standard split weight that can be queued for each worker node for a single stage of a query, even when the node is already at the limit for total number of splits.

maxSplitsPerNode

2000

Limits the number of splits that can be scheduled for each node. Increasing this value will allow the cluster to process more queries or reduce visibility of problems connected to data skew. Excessively high values may result in poor performance due to context switching and higher memory reservation for cluster metadata.

minScheduleSplitBatchSize

2000

Specifies the minimum number of splits to consider for scheduling per batch.

sinkMaxBufferSize

1GB

Increasing this value may improve network throughput for data transferred between stages, if the network has high latency, or if there are many nodes in the cluster.

enableEventLoop

true

This feature improvs efficiency of coordinator when running a large number of tasks.

The following properties are configurable in values.yaml in presto.hive, presto.iceberg and presto.delta sections:

Optimization properties by catalog

Property name

Default Value

Description

parquetBatchReadOptimizationEnabled

true

This optimization enables vectorized readings, which significantly improves performance. There is a separate property for each predefined catalog: hive, iceberg and delta.

pushdownFilterEnabled

true

This optimization makes table scan more efficient. There is a separate property for each predefined catalog: hive, iceberg and delta.

quickStats

true

This optimization builds stats from metadata when statistics are missing, which significantly improves performance. There is a separate property for each predefined catalog: hive, iceberg and delta.

maxOutstandingSplits

1800

Limit on the number of splits waiting to be served by a split source. After reaching this limit, writers will stop writing new splits until some of them are used by workers. Higher values will increase memory usage, but allow IO to be concentrated at one time, which may be faster and increase resource utilization.

maxInitialSplits

600

This property describes how many splits may be initially created for a single query using hive.max-initial-split-size instead of hive.max-split-size. The purpose of the smaller split size for the initial splits is to increase concurrency for smaller queries.

quicmaxInitialSplitSizekStats

128MB

This property describes the maximum size of the first hive.max-initial-splits splits created for a query. Lower values will increase concurrency for small queries. This property represents the maximum size, as the real size may be lower when the amount of data to read is less than hive.max-initial-split-size.

maxSplitSize

256MB

The maximum size of splits created after the initial splits. A higher value will reduce parallelism. This may be desirable for very large queries and a stable cluster because it allows for more efficient processing of local data without the context switching, synchronization and data collection that result from parallelization. The optimal value should be aligned with the average query size in the system.

splitLoaderConcurrency

32

Increase split loader concurrency could lead to some individual queries see 2x, 3x improvement

Add feedback