This document provides guidelines for estimating the size of a cluster with a SQL-on-Hadoop engine for Denodo 7.0 and newer versions to be used as an accelerator of queries issued through Denodo. It offers native integration for the following Hadoop engines: Impala, Presto and Spark.
We have tested several scenarios using the queries and the datasets provided by TPC-DS. TPC-DS is the most important standard industry benchmark for decision support systems. In our tests, we used two different scalings of the datasets: 100 GB and 1 TB.
The tests were done in two different scenarios: a physical cluster and a cloud-based provider (Amazon EMR).
In both scenarios, it is highly recommended to follow the optimization recommendations and guidelines of the Hadoop distribution vendor chosen. For further help on how to configure the MPP query acceleration in Denodo, you can check this configuration guide.
It is important that Denodo runs in a server that belongs to the same network segment as the cluster, to ensure fast communication among those.
It is recommended to have Denodo installed on an instance that has direct network connectivity to the cluster to avoid a communication bottleneck with the cluster. By default, EMR does not provide edge nodes, so the easiest way would be to install Denodo in the master node. However, to avoid introducing additional load to the master node, it would be recommended to configure a standalone Amazon EC2 instance as an edge node and use that instance to install Denodo.
By default, Amazon does not provide a graphical interface, so Denodo has to be installed through the command line.
It is important to note that the tests performed in Amazon EMR showed some variability in the query times. This effect was not observed in the tests performed on the physical cluster.