MPP Query Acceleration: Sizing guidelines

Applies to: Denodo 8.0 , Denodo 7.0
Last modified on: 26 Jun 2020
Tags: Best practices MPP Optimization Performance

Download document

You can translate the document:

Introduction

This document provides guidelines for estimating the size of a cluster with a SQL-on-Hadoop engine for Denodo 7.0 and newer versions to be used as an accelerator of queries issued through Denodo. It offers native integration for the following Hadoop engines: Impala, Presto and Spark.

We have tested several scenarios using the queries and the datasets provided by TPC-DS. TPC-DS is the most important standard industry benchmark for decision support systems. In our tests, we used two different scalings of the datasets: 100 GB and 1 TB.

The tests were done in two different scenarios: a physical cluster and a cloud-based provider (Amazon EMR).

In both scenarios, it is highly recommended to follow the optimization recommendations and guidelines of the Hadoop distribution vendor chosen. For further help on how to configure the MPP query acceleration in Denodo, you can check this configuration guide.

Physical cluster details

Node characteristics:

  • CPU 16 cores Intel(R) Xeon(R) CPU E5-2623 v3 @ 3.00GHz
  • 32 GB memory
  • 2 x 2TB disks. RAID1 in master node

It is important that Denodo runs in a server that belongs to the same network segment as the cluster, to ensure fast communication among those.

Amazon EMR general recommendations

It is recommended to have Denodo installed on an instance that has direct network connectivity to the cluster to avoid a communication bottleneck with the cluster. By default, EMR does not provide edge nodes, so the easiest way would be to install Denodo in the master node. However, to avoid introducing additional load to the master node, it would be recommended to configure a standalone Amazon EC2 instance as an edge node and use that instance to install Denodo.

By default, Amazon does not provide a graphical interface, so Denodo has to be installed through the command line.

It is important to note that the tests performed in Amazon EMR showed some variability in the query times. This effect was not observed in the tests performed on the physical cluster.

Recommended configurations for each engine

Impala

Physical cluster

  • Minimum recommended: 16 nodes.

Amazon EMR

  • The best performance was observed with a cluster of 16 nodes.
  • Recommended node type:

m4.4xlarge

  • Adding more nodes to the cluster does not provide better performance when running single queries, but it might improve the response times in scenarios with multiple concurrent queries.

Spark

Physical cluster

  • Minimum recommended: 16 nodes.

Amazon EMR

  • The best performance was observed with a cluster of 32 nodes.
  • Recommended node type:

m4.4xlarge

  • Adding more nodes to the cluster does not provide better performance when running single queries, but it might improve the response times in scenarios with multiple concurrent queries.

Presto

Physical cluster

  • Minimum recommended: 16 nodes.

Amazon EMR

  • The best performance was observed with a cluster of 16 nodes.
  • Recommended node type:

m4.4xlarge

  • Adding more nodes to the cluster does not provide better performance when running single queries, but it might improve the response times in scenarios with multiple concurrent queries.

        

Questions

Ask a question
You must sign in to ask a question. If you do not have an account, you can register here

Featured content

DENODO TRAINING

Ready for more? Great! We offer a comprehensive set of training courses, taught by our technical instructors in small, private groups for getting a full, in-depth guided training in the usage of the Denodo Platform. Check out our training courses.

Training