USER MANUALS

Iceberg

Apache Iceberg is a high-performance table format for large analytic datasets. Iceberg tables allow schema evolution, partition evolution and table version rollback, without the need to rewrite or migrate tables.

Denodo Embedded MPP allows querying data stored in Iceberg tables. For this, it needs a Metastore as a metadata catalog, that can be the Embedded Hive Metastore or an External Metastore. And a catalog of type iceberg.

iceberg.properties
connector.name=iceberg

# Embedded Hive Metastore
hive.metastore.uri=thrift://hive-metastore:9083

Before creating Iceberg tables you need to create a new schema that sets the location where the tables –their Parquet files and metadata files– will be placed.

For this you can use your favorite SQL client:

Create schema for new Iceberg tables
CREATE SCHEMA iceberg.<schema_name>
WITH (location = 's3a://my_bucket/path/to/folder/');

or the Denodo stored procedure CREATE_SCHEMA_ON_SOURCE:

Create schema procedure for new Iceberg tables
CALL CREATE_SCHEMA_ON_SOURCE(
   'admin_denodo_mpp',
   'embedded_mpp',
   'iceberg',
   '<schema_name>',
   's3a://my_bucket/path/to/folder/');

The CREATE TABLE sentence creates an empty table even if there is data already present in the S3 folder location. The creation of Iceberg tables does not work in the same way as the creation of Hive tables since, in the latter case, existing data in the S3 bucket compatible with the table schema would be considered as the contents of the table. But for an Iceberg table, additional metadata is required for it.

Create a New Iceberg table
CREATE TABLE orders (
   orderkey bigint,
   custkey bigint,
   orderstatus varchar,
   totalprice double,
   orderpriority varchar,
   clerk varchar,
   shippriority integer,
   comment varchar
) WITH (
   location = 's3a://my_bucket/path/to/folder/',
   format = 'PARQUET'
);

Therefore, to access Parquet datasets using Iceberg tables you must create Hive tables for those Parquet datasets and then use the CREATE TABLE AS SELECT (CTAS) statement to create the new Iceberg tables from those Hive tables.

Create Iceberg table with CTAS
CREATE TABLE IF NOT EXISTS iceberg.schema.ctas_orders
AS (SELECT * FROM hive.default.orders);
Create partitioned Iceberg table with CTAS
CREATE TABLE IF NOT EXISTS iceberg.schema.ctas_weblog
WITH (
        partitioning = ARRAY['elb_name', 'elb_response_code']
)
AS (SELECT * FROM hive.default.weblog)

The drawback of this method is that it will temporarily duplicate the storage of the dataset, since it will store both the data for the Hive table and the data for the new Iceberg table.

The alternative is to use the migrate Spark procedure, which replaces the existing Hive table with an Iceberg table using the same data files, as there is no Presto support for migrating Hive tables to Iceberg.

There is also no support for registering existing Iceberg tables in Presto, but the register_table Spark procedure can be used instead.

Once the Iceberg table is registered, you can use the embedded data source in Denodo to create a Denodo base view on top of the table using the From MPP Catalogs tab.

Explore Iceberg tables

Explore Iceberg tables

Features

The Denodo Embedded MPP provides the following features when treating with Iceberg tables:

Limitations

  • Create base views over data stored in Iceberg format

  • Bulk data load

  • Caching: full cache mode

  • Remote tables

Add feedback