USER MANUALS

Excel Sources

The Excel data sources extract data from Excel files using the Apache POI library (Apache POI API - the Java API for Microsoft Documents).

To create a new Excel data source, click Excel on the File > New > Data source menu. The Tool will display the dialog to create the data source.

Creating an Excel data source

Creating an Excel data source

The following data are requested in this dialog:

  • Name. Name of the data source.

  • Type of file. Depending on the version of the file, the option “Stream tuples” will be enabled or not (see below).

  • File location. Path to the Excel file. This path can be parameterized according to the query made, using interpolation variables (see section Paths and Other Values with Interpolation Variables). The section Path Types in Virtual DataPort describes the formats of the available paths and how to use parameters in them.

  • Worksheet (optional). Name of the worksheet that you want to extract data from.

    If left empty, the base view will retrieve the data from the first sheet of the file.

    You can also enter a comma-separated list of sheets. E.g. “Sheet1, Sheet2”.

    To extract data from all the worksheets of an Excel file, instead of just from one sheet, select Extract data from all sheets. In this case, the fields of the base view are determined by the columns of the first worksheet of the file. Therefore, you have to take into account the following:

    • If the following sheets have more columns, they will be ignored.

      In the rows with data obtained from a worksheet that does not have a column that is present in the first worksheet, the values for this column will be NULL.

      For example, let us say we have an Excel file with two worksheets. The first sheet (Sheet1) has the columns A, B and C; and the second sheet (Sheet2) has the columns A, C and D.

      The base view will have will have the same columns as the first sheet of the file: A, B and C. When you query this view and Virtual DataPort extracts data from the second sheet of the file, it will do the following:

      • It will return NULL for the values of the field B because the second sheet does not have it.

      • The values of the column D will be ignored.

    • These rules also apply when you enter more than one worksheet in the “Worksheets” box. In this case, the fields of the base view are determined by the columns of the first worksheet of this list.

  • Select Extract sheet name as a new column if you want the base view to have an extra column that will contain the name of the worksheet from which each row is obtained from. The default name of this extra column is “sheet_name”.

    Knowing from which worksheet a row comes from is useful when the “Extract data from all sheets” option is selected.

  • StartCell and EndCell (optional): first and last cell of the sheet that will be processed.

    If “StartCell” is B2 and “EndCell” is Z99, the view will return the data in the range B2:Z99.

    If “StartCell” is B2 and “EndCell” is empty, the view will retrieve all the data from the cell B2 to the end of the file.

    If “StartCell” is empty and “EndCell” is Z99, the view will retrieve all the data in the range A1:Z99.

    If both parameters are empty, the view will retrieve all the data in the sheet.

    If you select the option “Extract data from all sheets”, the base view of this data source will return the data of this range of cells, of each worksheet of the file.

  • Has headers: select this check box if the first row of the retrieved data contains the name of the columns and not data.

  • Stream tuples: if selected, the file will be read using a SAX parser (only available for Excel 2007 files (*.xslx)). If cleared, it will be read using a DOM parser.

    The benefit of the SAX parser over the DOM parser is that it has a very low memory footprint, which is important when working with big Excel files.

    Enabling or not “Stream tuples” affects the data types of the base view of this data source:

    • If “Stream tuples” is cleared, the Server obtains the type of each field from the file’s metadata.

    • If “Stream tuples” is selected, the Server infers the type of each column from the first rows of the file.

      The number of rows used to infer the types is controlled by the property com.denodo.vdb.contrib.wrapper.xls.event.content.introspection.maxrows.

      To change it, execute this command:

SET 'com.denodo.vdb.contrib.wrapper.xls.event.content.introspection.maxrows'
= '<number of rows>';

When you select “Stream tuples”, the columns with date values will be always inferred as “text” fields. If you need to obtain datetime values (localdate, timestamp…), you have two options:

  1. Clear the “Stream tuples” check box. By doing this, the base view will infer columns with date values as fields of type “date”. The problem is that you lose the advantages of using the SAX parser (i.e. performance).

    Do not to this for large Excel files, only clear this if you know the Excel file only has a few thousand rows or less.

  2. Or, create a Selection view over this base view and convert the text fields to a datetime fields using the functions TO_LOCALDATE, TO_TIME, TO_TIMESTAMP or TO_TIMESTAMPTZ.

In the Metadata tab, you can set the folder where the data source will be stored and provide a description. When editing the data source, you can also change its owner by clicking the button image1.

Click Save to create the data source. Then, click Create base view to display the schema that the new base view will have. Click Save to create the base view.

As opposed to other data sources, you can only create one base view over an Excel data source. If you need to create another base view to obtain data from the same file, create another Excel data source.

In Design Studio, the Excel data sources are considered a type of data source. However, internally, they are Custom data sources. This has several minor implications such as:

  • The VQL command to list the Excel data sources is LIST DATASOURCES CUSTOM, which returns a list of all the Custom data sources, not just the Excel ones.

  • When you export an Excel data source (right-click on the data source > Export), the selected type of source is “Custom”.

Size Limit When Processing Excel Files

To prevent memory issues, there is a limit on the size of the Excel worksheet that a data source will process. If the worksheet - not the file - is bigger, the query will fail with this error:

The file is not accessible or is not an Excel file. Tried to allocate an array of length xx, but the maximum length for this record type is 100,000,000.
If the file is not corrupt or large, please open an issue on bugzilla to request increasing the maximum allowable size for this record type.
As a temporary workaround, consider setting a higher override value with IOUtils.setByteArrayMaxOverride()

Note that Excel files are compressed when stored in disk; this limit is about the size of the worksheet after uncompressing it.

This limit is 150 megabytes by default. To modify it, execute this command:

SET 'com.denodo.vdb.contrib.wrapper.xls.event.byteArrayMax' = '<maximum size of sheet in bytes>';

For example, to increase this limit to 200 megabytes, execute this:

SET 'com.denodo.vdb.contrib.wrapper.xls.event.byteArrayMax' = '209715200';

Reading an Excel File with Multiple File Entries

If during the processing of an Excel worksheet from a data source the following message is displayed:

Received exception with message 'The file is not accessible or is not an Excel file. The file appears to be potentially malicious. This file embeds more internal file entries than expected.
This may indicates that the file could pose a security risk.
You can adjust this limit via ZipSecureFile.setMaxFileCount() if you need to work with files which are very large.
Limits: MAX_FILE_COUNT: 1000

The displayed message is due to a security measure designed to prevent security breaches and the exploitation of vulnerabilities. This is particularly beneficial when working with complex or large files that, although correct, exceed the default security threshold.

Adjusting this limit ensures that such files are accessible and processable, removing unnecessary restrictions that might prevent their proper opening and reading. The default value for this limit is 1000.

The following property allows changing this limit:

SET 'com.denodo.vdb.engine.wrapper.raw.excel.maxFileCount' = '<maximum file entries>';

For example, to increase this limit to 2000 file entries per Excel file, execute this:

SET 'com.denodo.vdb.engine.wrapper.raw.excel.maxFileCount' = '2000';

To remove property and reseting it to its default value you should execute the following sentence:

SET 'com.denodo.vdb.engine.wrapper.raw.excel.maxFileCount' = NULL;

Use Temporary Files to Load Excel Files

Note that the com.denodo.vdb.contrib.wrapper.xls.event.byteArrayMax property is used to determine which Excel documents will be processed based on the specified size of the sheet.

Excel files are loaded using the Apache POI library. By default, Apache POI loads the Excel file into memory, and if the file is too large, a Java heap space error can be thrown, stopping the processing. The error that could be found in the log files would be:

Unexpected error: Java heap space
java.lang.OutOfMemoryError: Java heap space
com.denodo.vdb.engine.AbstractResultMovement [] - com.denodo.vdb.cache.VDBCacheException: Execution interrupted.
com.denodo.vdb.cache.VDBCacheException: Execution interrupted.

To enable the load of Excel files using temporary files you should execute:

SET 'com.denodo.vdb.contrib.wrapper.xls.event.thresholdBytesForTempFiles' = '<size to start using temporary files in bytes>';

For example, to store sheets larger than 100MB, execute this:

SET 'com.denodo.vdb.contrib.wrapper.xls.event.thresholdBytesForTempFiles'='100000000';

This setting can be deactivated by using -1.

SET 'com.denodo.vdb.contrib.wrapper.xls.event.thresholdBytesForTempFiles'='-1';

The default directory for these temporary files is usually /tmp on Linux and %APPDATA%\Local\Temp on Windows, but it can be redefined using the JVM setting -Djava.io.tmpdir=<path>.

When a worksheet exceeds the specified memory limit, its contents are stored in a temporary file within the aforementioned directory. For example: %APPDATA%\Local\Temp\poifiles\poi-zip-entry12328023392292087152.tmp

These temporary files are marked for deletion on JVM exit and are explicitly deleted when the query ends, so the temporary directory should be empty unless Excel queries are currently being executed. Only if the JVM process is externally terminated might files not be properly deleted. There is no setting to limit the number or size of files created.

Fields Detected as Text During Introspection with Another Format

Virtual DataPort detects whether a field detected as text in introspection has date or numeric format and displays the value of a field based on the detected format.

This behavior can be disabled. In case of dates you have to execute the following:

SET 'com.denodo.vdb.contrib.wrapper.ExcelWrapper.skipEmbeddedFormatForDates'='false';

And in case of numbers, execute this command:

SET 'com.denodo.vdb.contrib.wrapper.ExcelWrapper.skipEmbeddedFormatForNumbers'='false';

If any of the above commands are executed, the value will be displayed as text for the format that has been disabled.

Add feedback