Embedded Hive Metastore¶
Denodo Embedded MPP is shipped with an Embedded Hive Metastore that acts as a repository of metadata mapping Parquet files, Iceberg tables or Delta Lake tables - from S3, ADLS, GCS or HDFS - to Hive, Iceberg or Delta Lake tables.
The three predefined catalogs access to this Embedded Hive Metastore:
the
hive
catalog reads and writes Hive tables of Parquet files from Embedded Hive Metastore.the
delta
catalog reads Delta Lake tables from Embedded Hive Metastore.the
iceberg
reads and writes Iceberg tables from Embedded Hive Metastore.
Note
hive
, delta
and iceberg
are used only by Denodo to create tables as a result of graphical exploration of datasets on the From object storage
tab of the Embedded MPP data source.
Therefore hive
, delta
and iceberg
are restricted catalogs, so they are not listed on the From MPP Catalogs
tab of the Embedded MPP data source.
Supported Operations by Format¶
Operation |
Hive |
Iceberg |
Delta |
---|---|---|---|
Read |
Yes |
Yes |
Yes |
Create/Insert |
Yes (*) |
Yes (**) |
No |
Update |
No |
No |
No |
Delete |
No |
Yes (***) |
No |
(*) To support write operations, make sure that Hive type catalogs have the following property:
hive.non-managed-table-writes-enabled=true
(**) To support write operations in Iceberg catalogs other than iceberg
, make sure that VDP is configured as follows:
SET 'com.denodo.util.jdbc.inspector.impl.PrestoJDBCInspector.iceberg.catalogNames' = 'iceberg, other_iceberg, another_iceberg';
(***) Delete operation is supported from Denodo 9.2.0. For V1 tables, the Iceberg connector can only delete data in one or more entire partitions. Columns in the filter must all be identity transformed partition columns of the target table.
Kerberos¶
If the dataset is located in an HDFS with Kerberos, you need to configure Kerberos-related properties:
You need to place the keytab file in the
hive-metastore/secrets
folder.Add the Hadoop properties related to Kerberos in the
hive-metastore/conf/core-site.xml
. This is just an example, it may be necessary to add extra properties.Kerberos configuration in core-site.xml¶<property> <name>hadoop.security.authorization</name> <value>true</value> </property> <property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.http.authentication.kerberos.keytab</name> <value>/opt/secrets/xxxx.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/xxxxx@YYYYYYY</value> </property>
This way the Embedded Hive Metastore connects to HDFS as the Kerberos principal
dfs.datanode.kerberos.principal
, using the keytabhadoop.http.authentication.kerberos.keytab
.Place the
krb5.conf
inhive-metastore/conf
Add the following
volumeMount
to to theadditionalVolumeMounts
property for metastore in thevalues.yaml
.Kerberos volumeMount in theadditionalVolumeMounts
property for metastore in thevalues.yaml
¶additionalVolumeMounts: - name: hive-metastore-vol mountPath: /etc/krb5.conf subPath: krb5.conf
Replace the command in
templates/hive-metastore-template.yaml
with:Kerberos command in hive-metastore-template.yaml¶command: ['sh', '-c', "kinit -k -t /opt/secrets/xxxx.keytab xxxx@YYYY; /opt/run-hive-metastore.sh"]
Important
The Kerberos ticket of the Embedded Hive Metastore needs to be renewed periodically. You can automatically run the kinit -k
command
by setting up a cron job.
Embedded Database¶
The Embedded Hive Metastore stores the metadata in an Embedded PostgreSQL.
The metastore
section of the values.yaml
configures the connection to the Embedded PostgreSQL.
metastore:
enabled: true
connectionUrl: "jdbc:postgresql://postgresql:5432/metastore"
connectionDriverName: "org.postgresql.Driver"
connectionDatabase: "metastore"
connectionUser: "hive"
connectionPassword: "hive"
postgresql:
enabled: true
External Database¶
You can also choose to use an alternative external database (PostgreSQL, MySQL, SQL Server or Oracle) to work with the Embedded Hive Metastore. The externally-managed database option has the advantage of keeping the metadata outside the cluster lifecycle. And in some cases, it is the only option, when there are policies restricting the type of RDBMS that can be installed, backups, maintenance, etc.
To configure an external database fill in the metastore.connectionXXX
parameters with the connection details.
Make sure that the external database can be accessed from the Embedded MPP cluster.
And do not forget to disable the Embedded PostgreSQL with postgresql.enabled=false
, so that the Embedded PostgreSQL is not deployed.
metastore:
enabled: true
connectionUrl: "jdbc:sqlserver://xxxx.database.windows.net:1433;..."
connectionDriverName: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionDatabase: "metastore"
connectionUser: "user@DOMAIN"
connectionPassword: "mypassword"
postgresql:
enabled: false
connectionUrl: JDBC connection string for the database of the Embedded Hive Metastore, which can be:
the Embedded PostgreSQL,
jdbc:postgresql://postgresql:5432/metastore
, the default onean external PostgreSQL
an external MySQL
an external SQL Server
an external Oracle
connectionDriverName: JDBC Driver class name to connect to the database of the Embedded Hive Metastore, which can be:
org.postgresql.Driver
for PostgreSQL, the default oneorg.mariadb.jdbc.Driver
for MySQLcom.microsoft.sqlserver.jdbc.SQLServerDriver
for SQL Serveroracle.jdbc.OracleDriver
for Oracle
The Hive Metastore heap size is set to 2048MB, but it is possible to configure it via values.yaml
according to your needs.
metastore:
maxHeapSize: 2048
In addition, there is an initialization script for the external databases: PostgreSQL, MySQL, SQL Server or Oracle, included in
the hive-metastore/scripts
that must be run on the external database before deploying the Denodo Embedded MPP.
Database |
Minimum supported version |
---|---|
Postgres |
9.1.13 |
MySQL |
5.6.17 |
MS SQL Server |
2008 R2 |
Oracle |
11g |
CPU and Memory Management in Kubernetes¶
Kubernetes schedules pods across nodes based on the resource requests and limits for CPU and Memory. If a container pod requests certain CPU and/or memory values, Kubernetes will only schedule it on a node that can guarantee those resources. Limits, on the other hand, ensure that a container pod never exceeds a certain value.
metastore:
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 1
memory: 1Gi
Note that resources
are commented out, as we leave this setting as a choice for the Kubernetes cluster administrator.