Query Performance Optimization¶
You can fine-tune the performance of the Embedded MPP engine to match your specific analytical needs. The following configuration properties are available and can be adjusted based on your workloads to achieve optimal efficiency and speed. They are configurable in values.yaml in presto.config section:
Property name |
Default Value |
Description |
|---|---|---|
|
true |
This optimization significantly improves the performance of queries with selective joins by avoiding reading data that would be filtered by join condition. |
|
false |
Enables the collection of histogram statistics for Iceberg tables. Using histograms provides more accurate row count estimate following filter operations, especially on skewed data. This feature is disabled by default. It should be enabled when managing skewed data to take advantage of the improved statistics. |
|
615MB |
This property controls the maximum estimated size of a table that can be broadcast when using AUTOMATIC join distribution type. This allows improving cluster concurrency and prevents bad plans when CBO misestimates size of joined tables. |
|
true |
Enabling optimized repartitioning should show CPU performance improvement. |
|
automatic |
With the automatic strategy the optimizer decides for each aggregation, e.g: disable partial aggregation when size reduction is below threshold |
|
true |
Infer predicates from column domains during predicate pushdown. These generated predicates can significantly reduce input data scanned, thereby reducing execution cost. |
|
1000 |
Specifies the maximum number of threads available for coordinator node for processing requests. |
|
2000 |
The number of outstanding splits with the standard split weight that can be queued for each worker node for a single stage of a query, even when the node is already at the limit for total number of splits. |
|
2000 |
Limits the number of splits that can be scheduled for each node. Increasing this value will allow the cluster to process more queries or reduce visibility of problems connected to data skew. Excessively high values may result in poor performance due to context switching and higher memory reservation for cluster metadata. |
|
2000 |
Specifies the minimum number of splits to consider for scheduling per batch. |
|
1GB |
Increasing this value may improve network throughput for data transferred between stages, if the network has high latency, or if there are many nodes in the cluster. |
|
true |
This feature improvs efficiency of coordinator when running a large number of tasks. |
The following properties are configurable in values.yaml in presto.hive, presto.iceberg and presto.delta sections:
Property name |
Default Value |
Description |
|---|---|---|
|
true |
This optimization enables vectorized readings, which significantly improves performance. There is a separate property for each predefined catalog: hive, iceberg and delta. |
|
true |
This optimization makes table scan more efficient. There is a separate property for each predefined catalog: hive, iceberg and delta. |
|
true |
This optimization builds stats from metadata when statistics are missing, which significantly improves performance. There is a separate property for each predefined catalog: hive, iceberg and delta. |
|
1800 |
Limit on the number of splits waiting to be served by a split source. After reaching this limit, writers will stop writing new splits until some of them are used by workers. Higher values will increase memory usage, but allow IO to be concentrated at one time, which may be faster and increase resource utilization. |
|
600 |
This property describes how many splits may be initially created for a single query using hive.max-initial-split-size instead of hive.max-split-size. The purpose of the smaller split size for the initial splits is to increase concurrency for smaller queries. |
|
128MB |
This property describes the maximum size of the first hive.max-initial-splits splits created for a query. Lower values will increase concurrency for small queries. This property represents the maximum size, as the real size may be lower when the amount of data to read is less than hive.max-initial-split-size. |
|
256MB |
The maximum size of splits created after the initial splits. A higher value will reduce parallelism. This may be desirable for very large queries and a stable cluster because it allows for more efficient processing of local data without the context switching, synchronization and data collection that result from parallelization. The optimal value should be aligned with the average query size in the system. |
|
32 |
Increase split loader concurrency could lead to some individual queries see 2x, 3x improvement |
