You can translate the document:

NOTE: This sizing document applies only to external MPP clusters, not to the Denodo embedded MPP cluster. For sizing of an embedded MPP cluster see Sizing recommendations for the Embedded MPP.

Introduction

This document provides guidelines for estimating the size of a cluster with a SQL-on-Hadoop engine for Denodo 7.0 and newer versions to be used as an accelerator of queries issued through Denodo. It offers native integration for the following Hadoop engines: Impala, Presto and Spark.

We have tested several scenarios using the queries and the datasets provided by TPC-DS. TPC-DS is the most important standard industry benchmark for decision support systems. In our tests, we used two different scalings of the datasets: 100 GB and 1 TB.

The tests were done in two different scenarios: a physical cluster and a cloud-based provider (Amazon EMR).

In both scenarios, it is highly recommended to follow the optimization recommendations and guidelines of the Hadoop distribution vendor chosen. For further help on how to configure the MPP query acceleration in Denodo, you can check this configuration guide.

Physical cluster details

Node characteristics:

  • CPU 16 cores Intel(R) Xeon(R) CPU E5-2623 v3 @ 3.00GHz
  • 32 GB memory
  • 2 x 2TB disks. RAID1 in master node

It is important that Denodo runs in a server that belongs to the same network segment as the cluster, to ensure fast communication among those.

Amazon EMR general recommendations

It is recommended to have Denodo installed on an instance that has direct network connectivity to the cluster to avoid a communication bottleneck with the cluster. By default, EMR does not provide edge nodes, so the easiest way would be to install Denodo in the master node. However, to avoid introducing additional load to the master node, it would be recommended to configure a standalone Amazon EC2 instance as an edge node and use that instance to install Denodo.

By default, Amazon does not provide a graphical interface, so Denodo has to be installed through the command line.

It is important to note that the tests performed in Amazon EMR showed some variability in the query times. This effect was not observed in the tests performed on the physical cluster.

Recommended configurations for each engine

Impala

Physical cluster

  • Minimum recommended: 16 nodes.

Amazon EMR

  • The best performance was observed with a cluster of 16 nodes.
  • Recommended node type:

m4.4xlarge

  • Adding more nodes to the cluster does not provide better performance when running single queries, but it might improve the response times in scenarios with multiple concurrent queries.

Spark

Physical cluster

  • Minimum recommended: 16 nodes.

Amazon EMR

  • The best performance was observed with a cluster of 32 nodes.
  • Recommended node type:

m4.4xlarge

  • Adding more nodes to the cluster does not provide better performance when running single queries, but it might improve the response times in scenarios with multiple concurrent queries.

Presto

Physical cluster

  • Minimum recommended: 16 nodes.

Amazon EMR

  • The best performance was observed with a cluster of 16 nodes.
  • Recommended node type:

m4.4xlarge

  • Adding more nodes to the cluster does not provide better performance when running single queries, but it might improve the response times in scenarios with multiple concurrent queries.

        

Disclaimer
The information provided in the Denodo Knowledge Base is intended to assist our users in advanced uses of Denodo. Please note that the results from the application of processes and configurations detailed in these documents may vary depending on your specific environment. Use them at your own discretion.
For an official guide of supported features, please refer to the User Manuals. For questions on critical systems or complex environments we recommend you to contact your Denodo Customer Success Manager.

Questions

Ask a question

You must sign in to ask a question. If you do not have an account, you can register here