USER MANUALS


Embedded Hive Metastore

Denodo Embedded MPP is shipped with an Embedded Hive Metastore that acts as a repository of metadata mapping Parquet files, Iceberg tables or Delta Lake tables - from S3, ADLS, GCS or HDFS - to Hive, Iceberg or Delta Lake tables.

The three predefined catalogs access to this Embedded Hive Metastore:

  • the hive catalog reads and writes Hive tables of Parquet files from Embedded Hive Metastore.

  • the delta catalog reads Delta Lake tables from Embedded Hive Metastore.

  • the iceberg reads and writes Iceberg tables from Embedded Hive Metastore.

Note

hive, delta and iceberg are used only by Denodo to create tables as a result of graphical exploration of datasets on the From object storage tab of the Embedded MPP data source.

Therefore hive, delta and iceberg are restricted catalogs, so they are not listed on the From MPP Catalogs tab of the Embedded MPP data source.

Supported Operations by Format

Operation

Hive

Iceberg

Delta

Read

Yes

Yes

Yes

Create/Insert

Yes (*)

Yes (**)

No

Update

No

Yes (***)

No

Delete

No

Yes (****)

No

(*) To support write operations, make sure that Hive type catalogs have the following property:

hive.non-managed-table-writes-enabled=true

(**) To support write operations in Iceberg catalogs other than iceberg, make sure that VDP is configured as follows:

SET 'com.denodo.util.jdbc.inspector.impl.PrestoJDBCInspector.iceberg.catalogNames' = 'iceberg, other_iceberg, another_iceberg';

(***) Update operations are supported from Denodo 9.3.0. Iceberg table updates require at least format version 2 and update mode must be merge-on-read

(****) Delete operation is supported from Denodo 9.2.0. For V1 tables, the Iceberg connector can only delete data in one or more entire partitions. Columns in the filter must all be identity transformed partition columns of the target table.

Kerberos

If the dataset is located in an HDFS with Kerberos, you need to configure Kerberos-related properties:

  • You need to place the keytab file in the hive-metastore/secrets folder.

  • Add the Hadoop properties related to Kerberos in the hive-metastore/conf/core-site.xml. This is just an example, it may be necessary to add extra properties.

    Kerberos configuration in core-site.xml
     <property>
       <name>hadoop.security.authorization</name>
       <value>true</value>
     </property>
    
     <property>
       <name>hadoop.security.authentication</name>
       <value>kerberos</value>
     </property>
    
     <property>
       <name>hadoop.http.authentication.kerberos.keytab</name>
       <value>/opt/secrets/xxxx.keytab</value>
     </property>
    
     <property>
       <name>dfs.datanode.kerberos.principal</name>
       <value>hdfs/xxxxx@YYYYYYY</value>
     </property>
    

    This way the Embedded Hive Metastore connects to HDFS as the Kerberos principal dfs.datanode.kerberos.principal, using the keytab hadoop.http.authentication.kerberos.keytab.

  • Place the krb5.conf in hive-metastore/conf

  • Add the following volumeMount to to the additionalVolumeMounts property for metastore in the values.yaml.

    Kerberos volumeMount in the additionalVolumeMounts property for metastore in the values.yaml
    additionalVolumeMounts:
      - name: hive-metastore-vol
        mountPath: /etc/krb5.conf
        subPath: krb5.conf
    
  • Replace the command in templates/hive-metastore-template.yaml with:

    Kerberos command in hive-metastore-template.yaml
      command: ['sh', '-c', "kinit -k -t /opt/secrets/xxxx.keytab xxxx@YYYY; /opt/run-hive-metastore.sh"]
    

Important

The Kerberos ticket of the Embedded Hive Metastore needs to be renewed periodically. You can automatically run the kinit -k command by setting up a cron job.

Embedded Database

The Embedded Hive Metastore stores the metadata in an Embedded PostgreSQL.

The metastore section of the values.yaml configures the connection to the Embedded PostgreSQL.

metastore:
  enabled: true
  connectionUrl: "jdbc:postgresql://postgresql:5432/metastore"
  connectionDriverName: "org.postgresql.Driver"
  connectionDatabase: "metastore"
  connectionUser: "hive"
  connectionPassword: "hive"

postgresql:
  enabled: true

External Database

You can also choose to use an alternative external database (PostgreSQL, MySQL, SQL Server or Oracle) to work with the Embedded Hive Metastore. The externally-managed database option has the advantage of keeping the metadata outside the cluster lifecycle. And in some cases, it is the only option, when there are policies restricting the type of RDBMS that can be installed, backups, maintenance, etc.

To configure an external database fill in the metastore.connectionXXX parameters with the connection details. Make sure that the external database can be accessed from the Embedded MPP cluster. And do not forget to disable the Embedded PostgreSQL with postgresql.enabled=false, so that the Embedded PostgreSQL is not deployed.

metastore:
  enabled: true
  connectionUrl: "jdbc:sqlserver://xxxx.database.windows.net:1433;..."
  connectionDriverName: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
  connectionDatabase: "metastore"
  connectionUser: "user@DOMAIN"
  connectionPassword: "mypassword"

postgresql:
  enabled: false
  • connectionUrl: JDBC connection string for the database of the Embedded Hive Metastore, which can be:

    • the Embedded PostgreSQL, jdbc:postgresql://postgresql:5432/metastore, the default one

    • an external PostgreSQL

    • an external MySQL

    • an external SQL Server

    • an external Oracle

  • connectionDriverName: JDBC Driver class name to connect to the database of the Embedded Hive Metastore, which can be:

    • org.postgresql.Driver for PostgreSQL, the default one

    • org.mariadb.jdbc.Driver for MySQL

    • com.microsoft.sqlserver.jdbc.SQLServerDriver for SQL Server

    • oracle.jdbc.OracleDriver for Oracle

The Hive Metastore heap size is set to 2048MB, but it is possible to configure it via values.yaml according to your needs.

metastore:
  maxHeapSize: 2048

In addition, there is an initialization script for the external databases: PostgreSQL, MySQL, SQL Server or Oracle, included in the hive-metastore/scripts that must be run on the external database before deploying the Denodo Embedded MPP.

Supported Databases for Embedded Hive Metastore

Database

Minimum supported version

Postgres

9.1.13

MySQL

5.6.17

MS SQL Server

2008 R2

Oracle

11g

CPU and Memory Management in Kubernetes

Kubernetes uses resource requests and resource limits to efficiently schedule pods across the cluster nodes.

  • Resource Requests: This specifies the minimum amount of a resource (CPU or Memory) that a container requires to function correctly. The Kubernetes scheduler will only place a Denodo Embedded MPP pod on a node that can guarantee the availability of the requested resources.

  • Resource Limits: This specifies the maximum amount of a resource (CPU or Memory) that a container is allowed to consume. Limits prevent a single pod from consuming all available resources on a node.

    • CPU Limits: If a pod tries to use more CPU than its limit, Kubernetes will throttle its CPU usage.

    • Memory Limits: If a pod tries to use more memory than its limit, Kubernetes will terminate (kill) the pod to prevent it from impacting the node. This often results in an “Out-Of-Memory” (OOMKilled) error.

The CPU and Memory resource requests and limits for the Denodo Embedded Hive Metastore pod can be configured within the metastore section of the values.yaml file:

metastore:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 1
      memory: 2Gi
  • CPU units:

    • 1.0 represents one full CPU core (or vCPU in cloud environments).

    • 0.1 or 100m (100 millicores) represents one-tenth of a CPU core.

  • Memory units:

    • Gi (Gibibytes) is the standard Kubernetes unit for memory. 1Gi = 1024Mi.

Notice that the resources section for metastore is commented out by default in the provided values.yaml. We leave these settings as a choice for the Kubernetes cluster administrator as the optimal CPU and Memory values are highly dependent on the instance types of the Kubernetes nodes, the workload patterns for the Denodo Embedded MPP, etc.

Add feedback