USER MANUALS

Hive

The Denodo Embedded MPP provides robust capabilities for interacting with data stored in Hive tables, particularly in Parquet format.

The Denodo Embedded MPP is distributed with a predefined catalog named hive connected to the Embedded Hive Metastore, enabling access to Hive tables from Parquet files.

Here are the default properties for the hive catalog:

connector.name=hive-hadoop2

# connection to the Embedded Hive Metastore
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore-timeout=20s

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

hive.copy-on-first-write-configuration-enabled=false
hive.parquet-batch-read-optimization-enabled=true
hive.parquet.use-column-names=true
hive.pushdown-filter-enabled=true
hive.quick-stats.enabled=true
hive.skip-empty-files=true

hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true

Note

The hive catalog is a restricted catalog. This means it is not listed on the From MPP Catalogs tab of the Embedded MPP data source.

Parquet

To query Parquet datasets using Denodo, you have to create the views and tables using the Embedded MPP data source. This data source provides a graphical interface to explore Parquet files, including those using the Hive style partitioning, create the tables in the Embedded MPP and then create the base views in Denodo.

To graphically explore and create views from Parquet datasets stored in Object Storage follow these steps:

  1. In Denodo, open the Embedded MPP data source.

  2. Connect it to your Object Storage in the Read Write tab (e.g., Amazon S3). For detailed steps, refer to the Parquet format using the embedded MPP section.

  3. Go to the From object storage tab.

  4. You will then be able to explore the Parquet files stored in your Object Storage.

  5. Select the desired datasets to create the tables in the Denodo Embedded MPP and create corresponding base views in Denodo.

Note

You can use DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE stored procedure to automatize this process.

Explore Parquet files

Explore Parquet files

Important

To explore Parquet datasets graphically, you need the Denodo subscription bundle Enterprise Plus and the update 8.0u20230301.

Features

The Denodo Embedded MPP data source provides the following features when interacting with Hive tables from Parquet files:

  • Graphical exploration and view creation: Explore Parquet datasets, create tables in the MPP and base views in Denodo via the From object storage tab of the Embedded MPP data source.

  • External Metastore Integration: Create base views over existing Hive tables managed by an External Metastore (e.g., Hive Metastore) via the From MPP Catalogs tab of the Embedded MPP data source.

    Note: To achieve this, you have to create a new catalog in the Embedded MPP that connects to the External Metastore. This is because the predefined hive catalog is restricted and cannot be accessed from the From MPP Catalogs tab.

  • Querying

  • Embedded MPP Acceleration

  • Bulk data load

  • Caching: full cache mode

  • Managing Remote Tables

Limitations

When working with Hive tables and Parquet files through the Denodo Embedded MPP, be aware of the following limitations:

  • Inserts limitations: it is not possible to insert data using bulk load into views created from Parquet files using either the From Object Storage or From MPP Catalogs tabs of the Embedded MPP data source.

    • Allowed insert strategies: You can add single rows by executing the INSERT statement, or multiple rows using the direct load strategy.

    Note

    Understanding load strategies:

    • Bulk Load: This strategy efficiently moves large amounts of data when the source view is not in the Embedded MPP.

      It generates temporary Parquet files and then uploads them those files to the HDFS URI path configured in the Write settings section of the Embedded MPP data source.

    • Direct Load: This strategy loads data directly into the database by executing an INSERT INTO table <SELECT QUERY> statement. Denodo will automatically choose this strategy over bulk load when possible. This strategy is possible as long as Denodo can push the entire <SELECT QUERY> to the Embedded MPP. That is, the query is delegable to the Embedded MPP data source, and both the source and target views are in the Embedded MPP.

Supported Operation by Metastore Type

The following table summarizes the Hive operations currently supported by the Denodo Embedded MPP for various Metastore types:

Operation

Hive Metastore

AWS Glue Data Catalog

Read

Yes

Yes

Create/Insert (*)

Yes

Yes

Update

No

No

Delete

No

No

(*) To support write operations, ensure that the Embedded MPP Hive catalog has the following property enabled:

hive.non-managed-table-writes-enabled=true

Be aware of the insert limitations that apply to Hive tables within the Embedded MPP.

Kerberos for HDFS

If the Parquet files are located in an HDFS cluster secured with Kerberos, you need to configure Kerberos-related properties in the hive catalog in values.yaml.

  1. Configure Kerberos properties in values.yaml. Add the following properties to the presto.hive.additionalConfig section:

    presto:
      hive:
        additionalConfig: [
          hive.hdfs.authentication.type=KERBEROS,
          hive.hdfs.presto.principal=xxxx@REALM, # Replace xxxx@REALM with your principal
          hive.hdfs.presto.keytab=/opt/secrets/xxx.keytab # Replace xxx.keytab with the name of your keytab file
        ]
    

    This configuration ensures that the Embedded MPP connects to HDFS using the specified Kerberos principal, hive.hdfs.presto.principal, and its corresponding keytab file, hive.hdfs.presto.keytab.

  2. Kerberos Files:

    • Place the keytab file in the presto/secrets folder within your Denodo Embedded MPP Helm chart.

    • Place the krb5.conf file in the presto/conf/catalog/ folder.

    • Add the following property to the values.yaml to inform Java about the krb5.conf location:

    presto:
      jvm:
        additionalJVMConfig: [
          -Djava.security.krb5.conf=/opt/presto-server/etc/catalog/krb5.conf
        ]
    
  3. Enable HDFS Wire Encryption (if required):

    If the HDFS cluster has HDFS wire encryption enabled, you must add one more property to the presto.hive.additionalConfig section in values.yaml:

       hive.hdfs.wire-encryption.enabled=true
    

For more detailed information on Hive Security Configuration and Kerberos support, you can refer to Hive Security Configuration — Kerberos Support.

Other File Formats: Avro, CSV, ORC

The Denodo Embedded MPP data source currently does not allow you to graphically explore datasets in file formats other than Parquet, such as Avro, CSV or ORC, directly from the From object storage tab. However, you can still work with these formats through a manual process.

You will need to manually create the Hive tables in the Denodo Embedded MPP using a JDBC client, and then create the Denodo base views over those Hive tables using the From MPP Catalogs tab of the Embedded MPP data source.

Follow these steps:

  1. Create a new catalog in the Embedded MPP:

    You must create a new catalog in the Embedded MPP specifically for accessing these file formats, as the predefined hive catalog is not accessible from the From MPP Catalogs tab The recommended ways to do this is by using the presto.catalog property in the values.yaml. This new catalog will then be accessible from the From MPP Catalogs tab.

    Below is an example of a Hive catalog definition, hive-orc, connected to the Embedded Hive Metastore, configured to read Avro, CSV or ORC datasets:

      # -- Additional catalogs
      catalog:
        #hive-orc: |-
        #  connector.name=hive-hadoop2
        #  hive.metastore.uri=thrift://hive-metastore:9083
        #  hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml
        #  hive.allow-drop-table=true
        #  hive.non-managed-table-writes-enabled=true
    
  2. Create the Hive Table in the Embedded MPP using a JDBC Client:

    Connect to the Denodo Embedded MPP using a JDBC client (such as DBeaver, or for simple SQL queries, you can use the Presto UI):

    Create Hive table from ORC datasets

    Create Hive table from ORC datasets

  3. Create the Denodo Base View:

    Once the Hive tables are created in the Embedded MPP via your new catalog, you can then create the Denodo base views over these tables by navigating to the From MPP Catalogs tab of the Embedded MPP data source in Denodo. Select your newly created catalog and the desired tables to generate the base views.

Add feedback