Manage Views Created From Object Storage¶
Views created from files in an Object Storage are different from other views in some aspects:
Inserts: It is possible to insert individual rows or insert using a SELECT query that can be fully delegated to the target data source (direct load). However, it is not currently possible to insert data into these kind of views using Denodo’s bulk load mechanism.
Export: The VQL of the view includes an id to identify the route defined on the data source for the external object storage. It also includes the relative path from that base route.
Lineage: If the base route of the data changes or the schema of the files (columns and types) differs from the schema in the view, the view will turn into an invalid state. In order for the view to be valid again you can recreate it from the data source with the new schema changes or fix the route to point to the right dataset.
Schema evolution: Schema evolution can be considered during the base view creation. As any other base view, it is also possible to periodically update the tables metadata executing a source refresh.
Gather statistics: In order to achieve the best performance in queries accessing these data, it is very important to compute the statistics on these views and keep them up to date. You can do that by executing the stored procedure COMPUTE_SOURCE_TABLE_STATS or following the steps in section Gathering the Statistics of Views
Metadata maintenance: There exists some maintenance tasks that is recommended to execute on a regular basis to make sure the data is complete and the performance is optimal:
Refresh partitions of Parquet tables: This is necessary to detect data added in new partitions. Use the procedure REFRESH_SOURCE_TABLE_METADATA for a single view or REFRESH_EMBEDDED_MPP_TABLES_METADATA for multiple views in bulk.
Clean metadata of Iceberg tables: Use stored procedures REMOVE_ICEBERG_VIEW_ORPHAN_FILES and REMOVE_ICEBERG_VIEW_SNAPSHOTS to remove unnecessary metadata. These operations are important as keeping unnecessary data can have an impact in performance and cost.
Note
Consider creating jobs using the Denodo Scheduler to compute the statistics and execute the metadata maintenance tasks periodically.
Source Refresh on Views From Object Storage¶
The Denodo Lakehouse Accelerator allows accessing data in an Object Storage using its embedded Massive Parallel Processing engine. For these base views created using the Lakehouse Accelerator data source, the Source Refresh has additional options to deal with the Schema Evolution. A new screen will be prompted before executing the source refresh, where you can tune up the introspection process:
This panel is analog to the one presented in section Create base views from data in an object storage considering schema evolution. It allows to configure the data sample Virtual DataPort will use to infer the view schema:
Maximum number of files to analyze: establishes the maximum number of files (file limit) to analyze when refreshing the table.
Consider files modified after this date (optional): in this field you can specify the time instant (in UTC time format) from which the files will be analyzed. The files will be analyzed from oldest to newest, starting on the specified instant until the file limit is reached. If this field is empty, Virtual DataPort will analyze the files from newest to oldest until the file limit is reached.
