USER MANUALS

Hive

The Denodo Lakehouse Accelerator provides robust capabilities for interacting with data stored in Hive tables, particularly in Parquet format.

The Denodo Lakehouse Accelerator is distributed with a predefined catalog named hive connected to the Embedded Hive Metastore, enabling access to Hive tables from Parquet files.

Here are the default properties for the hive catalog:

connector.name=hive-hadoop2

# connection to the Embedded Hive Metastore
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore-timeout=20s

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

hive.copy-on-first-write-configuration-enabled=false
hive.parquet-batch-read-optimization-enabled=true
hive.parquet.use-column-names=true
hive.pushdown-filter-enabled=true
hive.quick-stats.enabled=true
hive.skip-empty-files=true

hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true

Note

The hive catalog is a restricted catalog. This means it is not listed on the External Catalogs tab of the Denodo Lakehouse Accelerator data source.

Parquet

To query Parquet datasets using Denodo, you have to create the views and tables using the Denodo Lakehouse Accelerator data source. This data source provides a graphical interface to explore Parquet files, including those using the Hive style partitioning, create the tables in the Denodo Lakehouse Accelerator and then create the base views in Denodo.

To graphically explore and create views from Parquet datasets stored in Object Storage follow these steps:

  1. In Denodo, open the Denodo Lakehouse Accelerator data source.

  2. Connect it to your Object Storage in the Read Write tab (e.g., Amazon S3, Azure Data Lake Storage). For detailed steps, refer to the Object Storage Data in Open Table Formats (Parquet, Iceberg and Delta Lake) section.

  3. Go to the From Object Storage Routes tab.

  4. You will then be able to explore the Parquet files stored in your Object Storage.

  5. Select the desired datasets to create the tables in the Denodo Lakehouse Accelerator and create corresponding base views in Denodo.

Note

You can use DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE stored procedure to automatize this process.

Explore Parquet files

Explore Parquet files

Important

To explore Parquet datasets graphically, you need the Denodo subscription bundle Enterprise Plus.

Features

The Denodo Lakehouse Accelerator data source provides the following features when interacting with Hive tables from Parquet files:

  • Graphical exploration and view creation: Explore Parquet datasets, create tables in the Denodo Lakehouse Accelerator and base views in Denodo via the From Object Storage Routes tab of the Denodo Lakehouse Accelerator data source.

  • External Metastore Integration: Create base views over existing Hive tables managed by an External Metastore (e.g., Hive Metastore) via the External Catalogs tab of the Denodo Lakehouse Accelerator data source.

    Note: To achieve this, you have to create a new catalog in the Denodo Lakehouse Accelerator that connects to the External Metastore. This is because the predefined hive catalog is restricted and cannot be accessed from the External Catalogs tab.

  • Querying

  • Embedded MPP Acceleration

  • Bulk data load

  • Caching: full cache mode

  • Managing Remote Tables

Limitations

When working with Hive tables and Parquet files through the Denodo Lakehouse Accelerator, be aware of the following limitations:

  • Inserts limitations: it is not possible to insert data using bulk load into views created from Parquet files using either the From Object Storage Routes or External Catalogs tabs of the Denodo Lakehouse Accelerator data source.

    • Allowed insert strategies: You can add single rows by executing the INSERT statement, or multiple rows using the direct load strategy.

    Note

    Understanding load strategies:

    • Bulk Load: This strategy efficiently moves large amounts of data when the source view is not in the Denodo Lakehouse Accelerator.

      It generates temporary Parquet files and then uploads them those files to the HDFS URI path configured in the Write settings section of the Denodo Lakehouse Accelerator data source.

    • Direct Load: This strategy loads data directly into the database by executing an INSERT INTO table <SELECT QUERY> statement. Denodo will automatically choose this strategy over bulk load when possible. This strategy is possible as long as Denodo can push the entire <SELECT QUERY> to the Denodo Lakehouse Accelerator. That is, the query is delegable to the Denodo Lakehouse Accelerator data source, and both the source and target views are in the Denodo Lakehouse Accelerator.

Supported Operations by Metastore Type

The following table summarizes the Hive operations currently supported by the Denodo Lakehouse Accelerator for various Metastore types:

Operation

Hive Metastore

AWS Glue Data Catalog

Read

Yes

Yes

Create/Insert (*)

Yes

Yes

Update

No

No

Merge

No

No

Delete

No

No

(*) To support write operations, ensure that the Denodo Lakehouse Accelerator Hive catalog has the following property enabled:

hive.non-managed-table-writes-enabled=true

Be aware of the insert limitations that apply to Hive tables within the Denodo Lakehouse Accelerator.

Kerberos for HDFS

If the Parquet files are located in an HDFS cluster secured with Kerberos, you need to configure Kerberos-related properties in the hive catalog in values.yaml.

  1. Configure Kerberos properties in values.yaml. Add the following properties to the presto.hive.additionalConfig section:

    presto:
      hive:
        additionalConfig: [
          hive.hdfs.authentication.type=KERBEROS,
          hive.hdfs.presto.principal=xxxx@REALM, # Replace xxxx@REALM with your principal
          hive.hdfs.presto.keytab=/opt/secrets/xxx.keytab # Replace xxx.keytab with the name of your keytab file
        ]
    

    This configuration ensures that the Denodo Lakehouse Accelerator connects to HDFS using the specified Kerberos principal, hive.hdfs.presto.principal, and its corresponding keytab file, hive.hdfs.presto.keytab.

  2. Kerberos Files:

    • Place the keytab file in the presto/secrets folder within your Denodo Lakehouse Accelerator Helm chart.

    • Place the krb5.conf file in the presto/conf/catalog/ folder.

    • Add the following property to the values.yaml to inform Java about the krb5.conf location:

    presto:
      jvm:
        additionalJVMConfig: [
          -Djava.security.krb5.conf=/opt/presto-server/etc/catalog/krb5.conf
        ]
    
  3. Enable HDFS Wire Encryption (if required):

    If the HDFS cluster has HDFS wire encryption enabled, you must add one more property to the presto.hive.additionalConfig section in values.yaml:

       hive.hdfs.wire-encryption.enabled=true
    

For more detailed information on Hive Security Configuration and Kerberos support, you can refer to Hive Security Configuration — Kerberos Support.

Other File Formats: Avro, CSV, ORC

The Denodo Denodo Lakehouse Accelerator data source currently does not allow you to graphically explore datasets in file formats other than Parquet, such as Avro, CSV or ORC, directly from the From Object Storage Routes tab. However, you can still work with these formats through a manual process.

You will need to manually create the Hive tables in the Denodo Lakehouse Accelerator using a JDBC client, and then create the Denodo base views over those Hive tables using the External Catalogs tab of the Denodo Lakehouse Accelerator data source.

Follow these steps:

  1. Create a new catalog in the Denodo Lakehouse Accelerator:

    You must create a new catalog in the Denodo Lakehouse Accelerator specifically for accessing these file formats, as the predefined hive catalog is not accessible from the External Catalogs tab The recommended ways to do this is by using the presto.catalog property in the values.yaml. This new catalog will then be accessible from the External Catalogs tab.

    Below is an example of a Hive catalog definition, hive-orc, connected to the Embedded Hive Metastore, configured to read Avro, CSV or ORC datasets:

      # -- Additional catalogs
      catalog:
        #hive-orc: |-
        #  connector.name=hive-hadoop2
        #  hive.metastore.uri=thrift://hive-metastore:9083
        #  hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml
        #  hive.allow-drop-table=true
        #  hive.non-managed-table-writes-enabled=true
    
  2. Create the Hive Table in the Denodo Lakehouse Accelerator using a JDBC Client:

    Connect to the Denodo Denodo Lakehouse Accelerator using a JDBC client (such as DBeaver, or for simple SQL queries, you can use the Presto UI):

    Create Hive table from ORC datasets

    Create Hive table from ORC datasets

  3. Create the Denodo Base View:

    Once the Hive tables are created in the Denodo Lakehouse Accelerator via your new catalog, you can then create the Denodo base views over these tables by navigating to the External Catalogs tab of the Denodo Lakehouse Accelerator data source in Denodo. Select your newly created catalog and the desired tables to generate the base views.

Add feedback