Embedded Hive Metastore¶
Denodo Embedded MPP is shipped with an Embedded Hive Metastore that acts as a repository of metadata mapping Parquet files, Iceberg tables or Delta Lake tables - from S3, ADLS, GCS or HDFS - to Hive, Iceberg or Delta Lake tables.
The three predefined catalogs access to this Embedded Hive Metastore:
the
hivecatalog reads and writes Hive tables of Parquet files from Embedded Hive Metastore.the
deltacatalog reads Delta Lake tables from Embedded Hive Metastore.the
icebergreads and writes Iceberg tables from Embedded Hive Metastore.
Note
The hive catalog is a restricted catalog. This means it is not listed on the From MPP Catalogs tab of the Embedded MPP data source.
Supported Operations by Format¶
Operation |
Hive |
Iceberg |
Delta |
|---|---|---|---|
Read |
Yes |
Yes |
Yes |
Create/Insert |
Yes (*) |
No |
No |
Update |
No |
No |
No |
Delete |
No |
No |
No |
(*) To support write operations, make sure that Hive type catalogs have the following property:
hive.non-managed-table-writes-enabled=true
Kerberos¶
If the dataset is located in an HDFS with Kerberos, you need to configure Kerberos-related properties:
You need to place the keytab file in the
hive-metastore/secretsfolder.Add the Hadoop properties related to Kerberos in the
hive-metastore/conf/core-site.xml. This is just an example, it may be necessary to add extra properties.Kerberos configuration in core-site.xml¶<property> <name>hadoop.security.authorization</name> <value>true</value> </property> <property> <name>hadoop.security.authentication</name> <value>kerberos</value> </property> <property> <name>hadoop.http.authentication.kerberos.keytab</name> <value>/opt/secrets/xxxx.keytab</value> </property> <property> <name>dfs.datanode.kerberos.principal</name> <value>hdfs/xxxxx@YYYYYYY</value> </property>
This way the Embedded Hive Metastore connects to HDFS as the Kerberos principal
dfs.datanode.kerberos.principal, using the keytabhadoop.http.authentication.kerberos.keytab.Place the
krb5.confinhive-metastore/confAdd the following
volumeMountto to theadditionalVolumeMountsproperty for metastore in thevalues.yaml.Kerberos volumeMount in theadditionalVolumeMountsproperty for metastore in thevalues.yaml¶additionalVolumeMounts: - name: hive-metastore-vol mountPath: /etc/krb5.conf subPath: krb5.conf
Replace the command in
templates/hive-metastore-template.yamlwith:Kerberos command in hive-metastore-template.yaml¶command: ['sh', '-c', "kinit -k -t /opt/secrets/xxxx.keytab xxxx@YYYY; /opt/run-hive-metastore.sh"]
Important
The Kerberos ticket of the Embedded Hive Metastore needs to be renewed periodically. You can automatically run the kinit -k command
by setting up a cron job.
Embedded Database¶
The Embedded Hive Metastore stores the metadata in an Embedded PostgreSQL.
The metastore section of the values.yaml configures the connection to the Embedded PostgreSQL.
metastore:
enabled: true
connectionUrl: "jdbc:postgresql://postgresql:5432/metastore"
connectionDriverName: "org.postgresql.Driver"
connectionDatabase: "metastore"
connectionUser: "hive"
connectionPassword: "hive"
postgresql:
enabled: true
External Database¶
You can also choose to use an alternative external database (PostgreSQL, MySQL, SQL Server or Oracle) to work with the Embedded Hive Metastore. The externally-managed database option has the advantage of keeping the metadata outside the cluster lifecycle. And in some cases, it is the only option, when there are policies restricting the type of RDBMS that can be installed, backups, maintenance, etc.
To configure an external database fill in the metastore.connectionXXX parameters with the connection details.
Make sure that the external database can be accessed from the Embedded MPP cluster.
And do not forget to disable the Embedded PostgreSQL with postgresql.enabled=false, so that the Embedded PostgreSQL is not deployed.
metastore:
enabled: true
connectionUrl: "jdbc:sqlserver://xxxx.database.windows.net:1433;..."
connectionDriverName: "com.microsoft.sqlserver.jdbc.SQLServerDriver"
connectionDatabase: "metastore"
connectionUser: "user@DOMAIN"
connectionPassword: "mypassword"
postgresql:
enabled: false
connectionUrl: JDBC connection string for the database of the Embedded Hive Metastore, which can be:
the Embedded PostgreSQL,
jdbc:postgresql://postgresql:5432/metastore, the default onean external PostgreSQL
an external MySQL
an external SQL Server
an external Oracle
connectionDriverName: JDBC Driver class name to connect to the database of the Embedded Hive Metastore, which can be:
org.postgresql.Driverfor PostgreSQL, the default oneorg.mariadb.jdbc.Driverfor MySQLcom.microsoft.sqlserver.jdbc.SQLServerDriverfor SQL Serveroracle.jdbc.OracleDriverfor Oracle
The Hive Metastore heap size is set to 2048MB, but it is possible to configure it via values.yaml according to your needs.
metastore:
maxHeapSize: 2048
In addition, there is an initialization script for the external databases: PostgreSQL, MySQL, SQL Server or Oracle, included in
the hive-metastore/scripts that must be run on the external database before deploying the Denodo Embedded MPP.
Database |
Minimum supported version |
|---|---|
Postgres |
9.1.13 |
MySQL |
5.6.17 |
MS SQL Server |
2008 R2 |
Oracle |
11g |
CPU and Memory management in Kubernetes¶
Kubernetes uses resource requests and resource limits to efficiently schedule pods across the cluster nodes.
Resource Requests: This specifies the minimum amount of a resource (CPU or Memory) that a container requires to function correctly. The Kubernetes scheduler will only place a Denodo Embedded MPP pod on a node that can guarantee the availability of the requested resources.
Resource Limits: This specifies the maximum amount of a resource (CPU or Memory) that a container is allowed to consume. Limits prevent a single pod from consuming all available resources on a node.
CPU Limits: If a pod tries to use more CPU than its limit, Kubernetes will throttle its CPU usage.
Memory Limits: If a pod tries to use more memory than its limit, Kubernetes will terminate (kill) the pod to prevent it from impacting the node. This often results in an “Out-Of-Memory” (OOMKilled) error.
The CPU and Memory resource requests and limits for the Denodo Embedded Hive Metastore pod can be configured within the
metastore section of the values.yaml file:
metastore:
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 2Gi
CPU units:
1.0 represents one full CPU core (or vCPU in cloud environments).
0.1 or 100m (100 millicores) represents one-tenth of a CPU core.
Memory units:
Gi (Gibibytes) is the standard Kubernetes unit for memory. 1Gi = 1024Mi.
Notice that the resources section for metastore is commented out by default in the provided values.yaml.
We leave these settings as a choice for the Kubernetes cluster administrator as the optimal CPU and Memory values are highly dependent on
the instance types of the Kubernetes nodes, the workload patterns for the Denodo Embedded MPP, etc.
