USER MANUALS

Hive

The Denodo Embedded MPP is distributed with a predefined catalog named hive connected to the Embedded Hive Metastore for accessing Hive tables from Parquet files.

hive.properties
connector.name=hive-hadoop2

# Embedded Hive Metastore
hive.metastore.uri=thrift://hive-metastore:9083
hive.metastore-timeout=20s

hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml

hive.parquet.use-column-names=true
hive.skip-empty-files=true

hive.allow-drop-table=true
hive.non-managed-table-writes-enabled=true

Note

Hive is a restricted catalog, so it is not listed on the From MPP Catalogs tab of the Embedded MPP data source.

Parquet

To query Parquet datasets you have to create Hive tables in the Denodo Embedded MPP. For this you can use the Embedded MPP data source in Denodo to graphically explore Parquet datasets (including those using the Hive style partitioning), create the tables in the MPP and base views in Denodo. You can check Object Storage Data in Open Table Formats for more details on how to connect an Object Storage graphically.

Explore Parquet files

Explore Parquet files

Important

To graphically explore Parquet datasets you need the Denodo subscription bundle Enterprise Plus.

Features

The Denodo Embedded MPP provides the following features when treating with Parquet files:

  • Graphically explore Parquet datasets, create tables in the MPP and base views in Denodo (From object storage tab of the Embedded MPP data source)

  • Create base views over existing Hive tables in an External Metastore (From MPP Catalogs tab of the Embedded MPP data source). To do this create a new catalog in the Embedded MPP, because the predefined catalog named hive is restricted, so it cannot be accessed from the From MPP Catalogs tab.

  • Querying

  • Embedded MPP Acceleration

  • Bulk data load

  • Caching: full cache mode

  • Managing Remote Tables

Limitations

  • Inserts: it is not possible to insert data using bulk load into views created from Parquet files using the From Object Storage or From MPP Catalogs tabs of the Embedded MPP data source. It is possible to add single rows executing the SQL statement INSERT or multiple rows using direct load .

    Note

    • Bulk Load: A method for efficiently move large amounts of data into a database when the source view is not in the Embedded MPP. First generates Parquet files containing the data to be inserted in Parquet format, then uploads those files to the specific path, HDFS URI configured in the Write settings section of the Embedded MPP data source.

    • Direct Load: A method for loading data directly into the database by executing the INSERT INTO table <SELECT QUERY> query which is delegable to the Embedded MPP data source. That is, the source and target views are both in Embedded MPP.

Supported Operations by Metastore Type

Operation

Embedded Hive Metastore

External Hive Metastore

AWS Glue Data Catalog

Read

Yes

Yes

Yes

Create/Insert (*)

Yes

Yes

Yes

Update

No

No

No

Delete

No

No

No

(*) To support write operations, make sure that Hive type catalogs have the following property:

hive.non-managed-table-writes-enabled=true

Avro, CSV, ORC

To query datasets in file formats other than Parquet, such as Avro, CSV or ORC, you have to create the Hive tables in the Denodo Embedded MPP. For this you have to use a JDBC client, since the graphical manner is not available in the Denodo Embedded MPP data source. And then create the Denodo base views over those Hive tables using From MPP Catalogs tab of the Embedded MPP data source.

The steps to follow are:

  1. Create a new catalog in the Embedded MPP because the predefined catalog named hive is not accessible from the From MPP Catalogs tab of the Embedded MPP data source.

    Add a new catalog properties file, e.g., presto/conf/catalog/hive_orc_formats. Then you have to add the type of catalog to the connector.name property: hive-hadoop2 and any other properties required by the catalog type.

    Below there is an example of a Hive catalog properties connected to the Embedded Hive Metastore to read Avro, CSV or ORC datasets:

     connector.name=hive-hadoop2
     hive.metastore.uri=thrift://hive-metastore:9083
    
     hive.config.resources=/opt/presto-server/etc/catalog/core-site.xml
    
     hive.allow-drop-table=true
     hive.non-managed-table-writes-enabled=true
    
  2. Create the Hive table in the Embedded MPP with a JDBC client. For simple SQL queries like this you can use the Presto UI, as shown below. This SQL query editor has the limitation of returning only 100 records maximum per query.

    Create Hive table from ORC datasets

    Create Hive table from ORC datasets

  3. Create the Denodo base view on the previously created Hive table using From MPP Catalogs tab of the Embedded MPP data source.

Add feedback