Hive¶
The Denodo Lakehouse Accelerator provides robust capabilities for interacting with data stored in Hive tables, particularly in Parquet format.
The Denodo Lakehouse Accelerator is distributed with a predefined catalog named hive connected to the Embedded Hive Metastore, enabling access to Hive tables from Parquet files.
Here are the default properties for the hive catalog:
connector.name=hive-hadoop2
# connection to the Embedded Hive Metastore
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore-timeout=20s
hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml
hive.copy-on-first-write-configuration-enabled=false
hive.parquet-batch-read-optimization-enabled=true
hive.parquet.use-column-names=true
hive.pushdown-filter-enabled=true
hive.quick-stats.enabled=true
hive.skip-empty-files=true
hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true
Note
The hive catalog is a restricted catalog. This means it is not listed on the External Catalogs tab of the Denodo Lakehouse Accelerator data source.
Parquet¶
To query Parquet datasets using Denodo, you have to create the views and tables using the Denodo Lakehouse Accelerator data source. This data source provides a graphical interface to explore Parquet files, including those using the Hive style partitioning, create the tables in the Denodo Lakehouse Accelerator and then create the base views in Denodo.
To graphically explore and create views from Parquet datasets stored in Object Storage follow these steps:
In Denodo, open the Denodo Lakehouse Accelerator data source.
Connect it to your Object Storage in the
Read Writetab (e.g., Amazon S3, Azure Data Lake Storage). For detailed steps, refer to the Object Storage Data in Open Table Formats (Parquet, Iceberg and Delta Lake) section.Go to the
From Object Storage Routestab.You will then be able to explore the Parquet files stored in your Object Storage.
Select the desired datasets to create the tables in the Denodo Lakehouse Accelerator and create corresponding base views in Denodo.
Note
You can use DISCOVER_OBJECT_STORAGE_MPP_PROCEDURE stored procedure to automatize this process.
Explore Parquet files¶
Important
To explore Parquet datasets graphically, you need the Denodo subscription bundle Enterprise Plus.
Features¶
The Denodo Lakehouse Accelerator data source provides the following features when interacting with Hive tables from Parquet files:
Graphical exploration and view creation: Explore Parquet datasets, create tables in the Denodo Lakehouse Accelerator and base views in Denodo via the
From Object Storage Routestab of the Denodo Lakehouse Accelerator data source.External Metastore Integration: Create base views over existing Hive tables managed by an External Metastore (e.g., Hive Metastore) via the
External Catalogstab of the Denodo Lakehouse Accelerator data source.Note: To achieve this, you have to create a new catalog in the Denodo Lakehouse Accelerator that connects to the External Metastore. This is because the predefined
hivecatalog is restricted and cannot be accessed from theExternal Catalogstab.Querying
Limitations¶
When working with Hive tables and Parquet files through the Denodo Lakehouse Accelerator, be aware of the following limitations:
Inserts limitations: it is not possible to insert data using bulk load into views created from Parquet files using either the
From Object Storage RoutesorExternal Catalogstabs of the Denodo Lakehouse Accelerator data source.Allowed insert strategies: You can add single rows by executing the
INSERTstatement, or multiple rows using the direct load strategy.
Note
Understanding load strategies:
Bulk Load: This strategy efficiently moves large amounts of data when the source view is not in the Denodo Lakehouse Accelerator.
It generates temporary Parquet files and then uploads them those files to the
HDFS URIpath configured in theWrite settingssection of the Denodo Lakehouse Accelerator data source.Direct Load: This strategy loads data directly into the database by executing an
INSERT INTO table <SELECT QUERY>statement. Denodo will automatically choose this strategy over bulk load when possible. This strategy is possible as long as Denodo can push the entire<SELECT QUERY>to the Denodo Lakehouse Accelerator. That is, the query is delegable to the Denodo Lakehouse Accelerator data source, and both the source and target views are in the Denodo Lakehouse Accelerator.
Supported Operations by Metastore Type¶
The following table summarizes the Hive operations currently supported by the Denodo Lakehouse Accelerator for various Metastore types:
Operation |
Hive Metastore |
AWS Glue Data Catalog |
|---|---|---|
Read |
Yes |
Yes |
Create/Insert (*) |
Yes |
Yes |
Update |
No |
No |
Merge |
No |
No |
Delete |
No |
No |
(*) To support write operations, ensure that the Denodo Lakehouse Accelerator Hive catalog has the following property enabled:
hive.non-managed-table-writes-enabled=trueBe aware of the insert limitations that apply to Hive tables within the Denodo Lakehouse Accelerator.
Kerberos for HDFS¶
If the Parquet files are located in an HDFS cluster secured with Kerberos, you need to configure Kerberos-related properties in the hive catalog
in values.yaml.
Configure Kerberos properties in
values.yaml. Add the following properties to thepresto.hive.additionalConfigsection:presto: hive: additionalConfig: [ hive.hdfs.authentication.type=KERBEROS, hive.hdfs.presto.principal=xxxx@REALM, # Replace xxxx@REALM with your principal hive.hdfs.presto.keytab=/opt/secrets/xxx.keytab # Replace xxx.keytab with the name of your keytab file ]
This configuration ensures that the Denodo Lakehouse Accelerator connects to HDFS using the specified Kerberos principal,
hive.hdfs.presto.principal, and its corresponding keytab file,hive.hdfs.presto.keytab.Kerberos Files:
Place the keytab file in the
presto/secretsfolder within your Denodo Lakehouse Accelerator Helm chart.Place the
krb5.conffile in thepresto/conf/catalog/folder.Add the following property to the
values.yamlto inform Java about thekrb5.conflocation:
presto: jvm: additionalJVMConfig: [ -Djava.security.krb5.conf=/opt/presto-server/etc/catalog/krb5.conf ]
Enable HDFS Wire Encryption (if required):
If the HDFS cluster has HDFS wire encryption enabled, you must add one more property to the
presto.hive.additionalConfigsection invalues.yaml:hive.hdfs.wire-encryption.enabled=true
For more detailed information on Hive Security Configuration and Kerberos support, you can refer to Hive Security Configuration — Kerberos Support.
Other File Formats: Avro, CSV, ORC¶
The Denodo Denodo Lakehouse Accelerator data source currently does not allow you to graphically explore
datasets in file formats other than Parquet, such as Avro, CSV or ORC, directly from the From Object Storage Routes tab.
However, you can still work with these formats through a manual process.
You will need to manually create the Hive tables in the Denodo Lakehouse Accelerator using a JDBC client, and then create the Denodo base views
over those Hive tables using the External Catalogs tab of the Denodo Lakehouse Accelerator data source.
Follow these steps:
Create a new catalog in the Denodo Lakehouse Accelerator:
You must create a new catalog in the Denodo Lakehouse Accelerator specifically for accessing these file formats, as the predefined
hivecatalog is not accessible from theExternal Catalogstab The recommended ways to do this is by using thepresto.catalogproperty in thevalues.yaml. This new catalog will then be accessible from theExternal Catalogstab.Below is an example of a Hive catalog definition,
hive-orc, connected to the Embedded Hive Metastore, configured to read Avro, CSV or ORC datasets:# -- Additional catalogs catalog: #hive-orc: |- # connector.name=hive-hadoop2 # hive.metastore.uri=thrift://hive-metastore:9083 # hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml # hive.allow-drop-table=true # hive.non-managed-table-writes-enabled=true
Create the Hive Table in the Denodo Lakehouse Accelerator using a JDBC Client:
Connect to the Denodo Denodo Lakehouse Accelerator using a JDBC client (such as DBeaver, or for simple SQL queries, you can use the Presto UI):
Create Hive table from ORC datasets¶
Create the Denodo Base View:
Once the Hive tables are created in the Denodo Lakehouse Accelerator via your new catalog, you can then create the Denodo base views over these tables by navigating to the
External Catalogstab of the Denodo Lakehouse Accelerator data source in Denodo. Select your newly created catalog and the desired tables to generate the base views.
