USER MANUALS

Partition Tables

Hive tables can be partitioned, as this can improve query performance, especially when the columns being queried are partitioned columns. This means that, for each column value of the partitioned column, there will be a separate folder under the dataset location.

The Denodo Embedded MPP can create tables from partitioned Parquet datasets that follow the Hive naming convention for folders, i.e.: columnName=value, elb_name=abc.

.../weblogs/elb_name=abc/elb_response_code=200/…
.../weblogs/elb_name=abc/elb_response_code=404/…
.../weblogs/elb_name=xyz/elb_response_code=200/…
 CREATE TABLE IF NOT EXISTS weblogs (
     request_timestamp varchar,
     request_ip varchar,
     request_port integer,
     backend_ip varchar,
     backend_port integer,
     request_processing_time double,
     backend_processing_time double,
     client_response_time double,
     received_bytes bigint,
     sent_bytes bigint,
     request_verb varchar,
     url varchar,
     protocol varchar,
     elb_name varchar,
     elb_response_code varchar)
 WITH (
     format = 'parquet',
     partitioned_by = ARRAY['elb_name', 'elb_response_code'],
     external_location = 's3://my_bucket/path/to/folder/'
 );

Note

Hive tables require the partition columns to be the last columns of the CREATE TABLE statement.

However, if we query a partition table after creating it, we will see that it returns zero results. The Hive Metastore needs to discover what partitions exist by querying the underlying storage system. The Presto procedure system.sync_partition_metadata(schema_name, table_name, mode) is responsible for detecting the existence of partitions.

When inserting data into new partitions we need to invoke the sync_partition_metadata procedure again, to discover the new records.

call system.sync_partition_metadata('default', 'weblogs', 'FULL', true);

The embedded_mpp data source in Denodo calls sync_partition_metadata transparently, after each partition table creation.

Add feedback