USER MANUALS

Hive

The Denodo Embedded MPP is distributed with a predefined catalog named hive connected to the Embedded Hive Metastore for accessing Hive tables from Parquet files.

hive.properties
connector.name=hive-hadoop2

# Embedded Hive Metastore
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore-timeout=20s

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

hive.parquet.use-column-names=true
hive.skip-empty-files=true
hive.parquet-batch-read-optimization-enabled=true
hive.pushdown-filter-enabled=true

hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true

Note

hive is used only by Denodo to create tables as a result of graphical exploration of Parquet datasets on the From object storage tab of the Embedded MPP data source.

Therefore hive is a restricted catalogs, so it is not listed on the From MPP Catalogs tab of the Embedded MPP data source.

Parquet

The Denodo Embedded MPP data source allows you to graphically explore Parquet datasets, including those using the Hive style partitioning, create tables in Embedded MPP and create base views in Denodo.

Explore Parquet files

Explore Parquet files

Important

To graphically explore Parquet datasets you need the Denodo subscription bundle Enterprise Plus and the update 8.0u20230301.

Features

The Denodo Embedded MPP data source provides the following features when treating with Parquet files:

  • Graphically explore Parquet datasets, create tables in the MPP and base views in Denodo (From object storage tab of the Embedded MPP data source)

  • Create base views over existing Hive tables in an External Metastore (From MPP Catalogs tab of the Embedded MPP data source).

    To do this create a new catalog in the Embedded MPP, because the predefined catalog named hive is restricted and cannot be accessed from the From MPP Catalogs tab.

  • Querying

  • Embedded MPP Acceleration

  • Bulk data load

  • Caching: full cache mode

  • Managing Remote Tables

Limitations

  • Inserts: it is not possible to insert data using bulk load into views created from Parquet files using the From Object Storage or From MPP Catalogs tabs of the Embedded MPP data source.

    But it is possible to add single rows executing the SQL statement INSERT or multiple rows using direct load .

    Note

    • Bulk Load: A method for efficiently move large amounts of data into a database when the source view is not in the Embedded MPP. First generates Parquet files containing the data to be inserted in Parquet format, then uploads those files to the specific path, HDFS URI configured in the Write settings section of the Embedded MPP data source.

    • Direct Load: A method for loading data directly into the database by executing the INSERT INTO table <SELECT QUERY> query which is delegable to the Embedded MPP data source. That is, the source and target views are both in Embedded MPP.

Supported Operation by Metastore Type

Operation

Embedded Hive Metastore

External Hive Metastore

AWS Glue Data Catalog

Read

Yes

Yes

Yes

Create/Insert (*)

Yes

Yes

Yes

Update

No

No

No

Delete

No

No

No

(*) To support write operations, make sure that Hive type catalogs have the following property:

hive.non-managed-table-writes-enabled=true

Kerberos

If the Parquet files are located in an HDFS with Kerberos, you need to configure Kerberos-related properties in the hive catalog, adding them to the additionalConfig property in values.yaml:

HDFS Kerberos configuration in values.yaml
presto:
  hive:
    additionalConfig: [
      hive.hdfs.authentication.type=KERBEROS,
      hive.hdfs.presto.principal=xxxx@REALM,
      hive.hdfs.presto.keytab=/opt/secrets/xxx.keytab
    ]

This way the Embedded MPP will connect to HDFS as the Kerberos principal hive.hdfs.presto.principal, using the keytab hive.hdfs.presto.keytab.

You also need to place the keytab file in the presto/secrets folder, the krb5.conf in the presto/conf/catalog/ folder and add the following property to the values.yaml:

krb5.conf configuration
presto:
  jvm:
    additionalJVMConfig: [
      -Djava.security.krb5.conf=/opt/presto-server/etc/catalog/krb5.conf
    ]

If hadoop.rpc.protection=privacy is required by the Hadoop Cluster then one more property must be added to the catalog configuration:

Enable HDFS wire encryption in values.yaml
hive.hdfs.wire-encryption.enabled=true

You can find more information in Hive Security Configuration — Kerberos Support.

Avro, CSV, ORC

The Denodo Embedded MPP data source does not allow you to graphically explore datasets in file formats other than Parquet, such as Avro, CSV or ORC. But you can do it manually.

For this you have to create the Hive tables in the Denodo Embedded MPP using a JDBC client. And then create the Denodo base views over those Hive tables using From MPP Catalogs tab of the Embedded MPP data source.

The steps to follow are:

  1. Create a new catalog in the Embedded MPP because the predefined catalog named hive is not accessible from the From MPP Catalogs tab of the Embedded MPP data source.

    The recommended ways is using the presto.catalog property in values.yaml. This new catalog will be accessed from the From MPP Catalogs tab.

    Below there is an example of a Hive catalog properties, hive-orc, connected to the Embedded Hive Metastore to read Avro, CSV or ORC datasets:

      # -- Additional catalogs
      catalog:
        #hive-orc: |-
        #  connector.name=hive-hadoop2
        #  hive.metastore.uri=thrift://hive-metastore:9083
        #  hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml
        #  hive.allow-drop-table=true
        #  hive.non-managed-table-writes-enabled=true
    
  2. Create the Hive table in the Embedded MPP with a JDBC client. For simple SQL queries like this you can use the Presto UI, as shown below. This SQL query editor has the limitation of returning only 100 records maximum per query.

    Create Hive table from ORC datasets

    Create Hive table from ORC datasets

  3. Create the Denodo base view on the previously created Hive table using From MPP Catalogs tab of the Embedded MPP data source.

Add feedback