USER MANUALS

Embedded Hive Metastore

Denodo Lakehouse Accelerator (formerly known as Denodo Embedded MPP) is shipped with an Embedded Hive Metastore that acts as a repository of metadata mapping Parquet files, Iceberg tables or Delta Lake tables - from S3, ADLS, GCS or HDFS - to Hive, Iceberg or Delta Lake tables.

The three predefined catalogs access to this Embedded Hive Metastore:

  • the hive catalog reads and writes Hive tables of Parquet files from Embedded Hive Metastore.

  • the delta catalog reads Delta Lake tables from Embedded Hive Metastore.

  • the iceberg reads and writes Iceberg tables from Embedded Hive Metastore.

Note

hive, delta and iceberg are used only by Denodo to create tables as a result of graphical exploration of datasets on the From Object Storage Routes tab of the Denodo Lakehouse Accelerator data source ‘embedded_mpp’ located in the virtual database ‘admin_denodo_mpp’.

Therefore hive, delta and iceberg are restricted catalogs, so they are not listed on the External Catalogs tab of the Denodo Lakehouse Accelerator data source.

Supported Operations by Format

Operation

Hive

Iceberg

Delta

Read

Yes

Yes

Yes

Create/Insert

Yes (*)

Yes (**)

No

Update

No

Yes (***)

No

Merge

No

Yes (****)

No

Delete

No

Yes (*****)

No

(*) To support write operations, make sure that Hive type catalogs have the following property:

hive.non-managed-table-writes-enabled=true

(**) To support write operations in Iceberg catalogs other than iceberg, make sure that VDP is configured as follows:

SET 'com.denodo.util.jdbc.inspector.impl.PrestoJDBCInspector.iceberg.catalogNames' = 'iceberg, other_iceberg, another_iceberg';

(***) Iceberg table updates require at least format version 2 and update mode must be merge-on-read

(****) Merge operations are supported. Iceberg table merges require at least format version 2 and write.update.mode mode must be merge-on-read Iceberg tables do not support running multiple MERGE statements on the same table in parallel. If two or more MERGE operations are executed concurrently on the same Iceberg table: * The first operation to complete will succeed. * Subsequent operations will fail due to conflicting writes and will return the following error:

   Failed to commit Iceberg update to table: <table name>
   Found conflicting files that can contain records matching true

(*****) For V1 tables, the Iceberg connector can only delete data in one or more entire partitions. Columns in the filter must all be identity transformed partition columns of the target table.

Kerberos

If the dataset is located in an HDFS with Kerberos, you need to configure Kerberos-related properties:

  • You need to place the keytab file in the hive-metastore/secrets folder.

  • Add the Hadoop properties related to Kerberos in the hive-metastore/conf/core-site.xml. This is just an example, it may be necessary to add extra properties.

    Kerberos configuration in core-site.xml
     <property>
       <name>hadoop.security.authorization</name>
       <value>true</value>
     </property>
    
     <property>
       <name>hadoop.security.authentication</name>
       <value>kerberos</value>
     </property>
    
     <property>
       <name>hadoop.http.authentication.kerberos.keytab</name>
       <value>/opt/secrets/xxxx.keytab</value>
     </property>
    
     <property>
       <name>dfs.datanode.kerberos.principal</name>
       <value>hdfs/xxxxx@YYYYYYY</value>
     </property>
    

    This way the Embedded Hive Metastore connects to HDFS as the Kerberos principal dfs.datanode.kerberos.principal, using the keytab hadoop.http.authentication.kerberos.keytab.

  • Place the krb5.conf in hive-metastore/conf

  • Add the following volumeMount to to the additionalVolumeMounts property for metastore in the values.yaml.

    Kerberos volumeMount in the additionalVolumeMounts property for metastore in the values.yaml
    additionalVolumeMounts:
      - name: hive-metastore-vol
        mountPath: /etc/krb5.conf
        subPath: krb5.conf
    
  • Replace the command in templates/hive-metastore-template.yaml with:

    Kerberos command in hive-metastore-template.yaml
      command: ['sh', '-c', "kinit -k -t /opt/secrets/xxxx.keytab xxxx@YYYY; /opt/run-hive-metastore.sh"]
    

Important

The Kerberos ticket of the Embedded Hive Metastore needs to be renewed periodically. You can automatically run the kinit -k command by setting up a cron job.

Embedded Database

The Embedded Hive Metastore stores the metadata in an Embedded PostgreSQL.

The metastore section of the values.yaml configures the connection to the Embedded PostgreSQL.

metastore:
  enabled: true
  connectionUrl: "jdbc:postgresql://postgresql:5432/metastore"
  connectionDriverName: "org.postgresql.Driver"
  connectionDatabase: "metastore"
  connectionUser: "hive"
  connectionPassword: "hive"

postgresql:
  enabled: true

External Database

You can also choose to use an alternative external database (PostgreSQL, MySQL, SQL Server or Oracle) to work with the Embedded Hive Metastore. The externally-managed database option has the advantage of keeping the metadata outside the cluster lifecycle. And in some cases, it is the only option, when there are policies restricting the type of RDBMS that can be installed, backups, maintenance, etc.

To configure an external database fill in the metastore.connectionXXX parameters with the connection details. Make sure that the external database can be accessed from the Denodo Lakehouse Accelerator cluster. And do not forget to disable the Embedded PostgreSQL with postgresql.enabled=false, so that the Embedded PostgreSQL is not deployed.

metastore:
  enabled: true
  connectionUrl: "jdbc:sqlserver://xxxx.database.windows.net:1433;..."
  connectionDriverName: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
  connectionDatabase: "metastore"
  connectionUser: "user@DOMAIN"
  connectionPassword: "mypassword"

postgresql:
  enabled: false
  • connectionUrl: JDBC connection string for the database of the Embedded Hive Metastore, which can be:

    • the Embedded PostgreSQL, jdbc:postgresql://postgresql:5432/metastore, the default one

    • an external PostgreSQL

    • an external MySQL

    • an external SQL Server

    • an external Oracle

  • connectionDriverName: JDBC Driver class name to connect to the database of the Embedded Hive Metastore, which can be:

    • org.postgresql.Driver for PostgreSQL, the default one

    • org.mariadb.jdbc.Driver for MySQL

    • com.microsoft.sqlserver.jdbc.SQLServerDriver for SQL Server

    • oracle.jdbc.OracleDriver for Oracle

The Hive Metastore heap size is set to 2048MB, but it is possible to configure it via values.yaml according to your needs.

metastore:
  maxHeapSize: 2048

In addition, there is an initialization script for the external databases: PostgreSQL, MySQL, SQL Server or Oracle, included in the hive-metastore/scripts that must be run on the external database before deploying the Denodo Lakehouse Accelerator.

Supported Databases for Embedded Hive Metastore

Database

Minimum supported version

Postgres

9.1.13

MySQL

5.6.17

MS SQL Server

2008 R2

Oracle

11g

CPU and Memory Management in Kubernetes

Kubernetes uses resource requests and resource limits to efficiently schedule pods across the cluster nodes.

  • Resource Requests: This specifies the minimum amount of a resource (CPU or Memory) that a container requires to function correctly. The Kubernetes scheduler will only place a Denodo Lakehouse Accelerator pod on a node that can guarantee the availability of the requested resources.

  • Resource Limits: This specifies the maximum amount of a resource (CPU or Memory) that a container is allowed to consume. Limits prevent a single pod from consuming all available resources on a node.

    • CPU Limits: If a pod tries to use more CPU than its limit, Kubernetes will throttle its CPU usage.

    • Memory Limits: If a pod tries to use more memory than its limit, Kubernetes will terminate (kill) the pod to prevent it from impacting the node. This often results in an “Out-Of-Memory” (OOMKilled) error.

The CPU and Memory resource requests and limits for the Denodo Embedded Hive Metastore pod can be configured within the metastore section of the values.yaml file:

metastore:
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 1
      memory: 2Gi
  • CPU units:

    • 1.0 represents one full CPU core (or vCPU in cloud environments).

    • 0.1 or 100m (100 millicores) represents one-tenth of a CPU core.

  • Memory units:

    • Gi (Gibibytes) is the standard Kubernetes unit for memory. 1Gi = 1024Mi.

Notice that the resources section for metastore is commented out by default in the provided values.yaml. We leave these settings as a choice for the Kubernetes cluster administrator as the optimal CPU and Memory values are highly dependent on the instance types of the Kubernetes nodes, the workload patterns for the Denodo Lakehouse Accelerator, etc.

Add feedback