USER MANUALS


Sizing Recommendations for the Denodo Lakehouse Accelerator

The size and resources available for the Denodo Lakehouse Accelerator cluster can have a big impact on the overall performance so it is important to choose the best configuration for your scenario. Optimal sizing can vary a lot depending on the characteristics of your workloads. Therefore, we recommend you to perform your own sizing tests using data and queries as close as possible to your real scenario. In order to do so, consider this:

  • If you are configuring N Lakehouse Accelerator workers, create your cluster with N+2 nodes so each worker is placed on one node and there is extra room for your coordinator and your metastore.

  • In general, adding more workers will make your queries run faster as those operations that the Denodo Lakehouse Accelerator can run in parallel will be distributed among more nodes. For instance, reading Parquet files from a table.

  • Increasing the number of CPU cores will also speed-up queries, specially under concurrency.

  • For the previous considerations, make sure the total number of CPU cores of your Denodo Lakehouse Accelerator cluster does not reach the maximum allowed by your Denodo license.

  • Ensure that your cluster has abundant memory. It is important to make sure there is enough memory available to avoid the Denodo Lakehouse Accelerator from spilling data to disk or crashing under concurrency.

The recommendations below are based on our own sizing tests using TPC-DS benchmark at scale factor 100. Therefore, they should be informative for workloads composed of analytical queries working on hundreds of millions of rows or higher:

  • Start with 8 workers or higher running on nodes with at least 128GB of memory and 16-32 cores. For example, in Amazon Web Services you can start with m6a.8xlarge or r6a.4xlarge nodes.

  • If you experience crashes or poor performance try doubling the size of your cluster. With queries managing hundreds of millions of rows, even isolated queries with no concurrency will typically run much faster than with smaller clusters. With concurrency, smaller clusters will probably not be able to provide adequate performance. When you double the number of workers in your cluster, you can expect to see:

    • With no concurrency: better performance for most queries, but not dramatic improvements. In many cases you will see 5-15% improvements.

    • Moderate concurrency (e.g. sustained load of 5 concurrent queries): average execution times can decrease between 30-40% as you double the number of workers, but the effect will weaken with every increase.

    • Medium concurrency (e.g. 10-15 sustained concurrent queries). The effects of doubling the number of workers will be stronger and more sustained. With 8 workers (128GB and 32 cores each), a significant number of queries can start to take minutes, so you should probably use at least 16 workers.

    • Higher concurrency: start with 32 workers (128GB and 32 cores each) and scale your cluster accordingly with the expected workload.

  • When increasing the size of the cluster does not improve your performance try using nodes with more memory and CPU. Take into account that some operations cannot run in parallel (like an aggregation consolidating distributed results) and may require more resources specially under concurrency. For instance, in AWS it’s typically a good idea to use memory optimized nodes, like rx8large, instead of general purpose nodes, like mx8large. That said, memory optimized nodes can be significantly more expensive and the effects can vary a lot for different workloads, so you may want to perform specific tests for your scenario. In general, the effect will increase as you increase concurrency but it can vary a lot depending on the query, from small improvements to executions running 3 times faster with moderate-high concurrency.

  • Once you have achieved the desired performance you can configure your cluster to use autoscaling to make sure you only use the resources you need:

Note

If increasing the size of the cluster and the resources of the cluster nodes does not improve your performance check the latency of the network between Denodo and the Lakehouse Accelerator cluster and between the Lakehouse Accelerator cluster and the object storage you are accessing.

Sizing recommendations for each Denodo environment

Each Denodo environment should have its own Denodo Lakehouse Accelerator cluster configured (see section Promoting the Denodo Lakehouse Accelerator for more information). The specifications for each environment differ because each has unique load patterns and distinct performance requirements. To adjust to each environment and avoid unnecessary costs, consider the following recommendations:

  • Production environments: For production, follow the sizing recommendations described before to ensure the best performance.

  • Staging environments: The Staging cluster should be similar in size to the production environment but can be stopped most of the time, as it’s only required during testing before a promotion.

  • Development evironments: Dev clusters can be smaller and use more cost-effective instances since performance requirements are not as demanding as in production. In addition, it can be stopped during non-working hours, and it’s often possible to configure autoscaling with a low minimum.

Finally, for those scenarios that use the Denodo Lakehouse Accelerator to accelerate queries only with its MPP engine, and not for caching or to access data in an object storage, the development environment can work without configuring a Denodo Lakehouse Accelerator cluster for it, as the only effect would be on the performance of some queries. In this sense, it is not strictly necessary to configure a cluster for the development environment. However, we recommend allocating at least a small cluster to test its configuration and query executions before promoting to a higher environment.

Add feedback