Denodo FileSystem CustomWrapper - User Manual
You can translate the document:
Introduction
The filesystem-customwrapper is a Virtual DataPort custom wrapper that enables VDP to retrieve information from a filesystem and display it in a relational way. This allows you to inspect local and network folders and also folders which are accessible by a FTP server, retrieve lists of files (in a single folder or recursively) and filter files using any part of its metadata (file name, file size, last modification date, etc.). The custom wrapper can also read their contents --in text or binary form-- and allows creating, updating and deleting text files (except when using FTP).
The custom wrapper allows you to retrieve the meta-information of the files such as filename, size or date modified. Three custom wrapper classes are provided in the JAR file:
- com.denodo.connect.filesystem.ReadFileSystemConnector: provides access to the metadata of the files.
- com.denodo.connect.filesystem.ReadWriteFileSystemConnector: in addition to the metadata of the files, allows you to read the file contents and to create, write and delete files.
- com.denodo.connect.filesystem.ReadFileSystemFTPConnector: provides access to the metadata of the files through FTP, FTPS or SFTP connections.
One of the possible uses of this wrapper is selecting a set of files in a folder (or folders, recursively), then filter them by date, size and/or file name and use that list of files in the left branch of a nested join operation to pass them to an XML or Delimited File base view that uses a parameterized file name. This way we can extend the native capabilities of the XML and Delimited File data sources; they allow to only read one file or all the files in a folder, but by using a nested join with the file system custom wrapper we can select a subset of all the files, and/or files recursively in subfolders.
The entry point for the queries to the wrapper is the folder where the search for files will be performed, specified through the parameter parentfolder. The VDP customwrapper will list the files are located in that directory; it is possible to do this in a recursive way, retrieving also the contents of the subfolders (also recursively). The parameter recursive is mandatory also as parentfolder. You always should specify this field: false in the case that you only want search the files in the specified directories and true if you want to do a search also in subdirectories recursively.
Features
Capabilities
This wrapper is able to carry out read, create, update and delete operations. It can be delegated the following query artifacts and operators from VDP:
- Operators: =, <>, <=, =>, <, >, REGEXP_LIKE, LIKE, IS_TRUE, IS_FALSE.
- AND operations.
- NOT operations.
- Append. This parameter specifies if we want to replace the content of the file or simply append it to the end when performing insert operations.
- Overwrite. When executing insert queries over the wrapper this parameter specifies the behavior when trying to create files that already exist; if true, the file will be overwritten, if false the query will result in an error.
- Recursive. When false, the wrapper will list all the files that are direct children of the folder parentfolder. When true, it will recursively search in all the subfolders of parentfolder and will return all the files that have parentfolder as an ancestor.
- The custom wrapper allows access, not only to files locally living at the VDP server’s file system, but also to any remote files accessible through a network-mounted filesystem.
Limitations
- Creating and updating binary files is not allowed. This is due to a limitation in VDP, which does not support the insertion of values with blob data type.
- The ReadWriteFileSystemConnector wrapper (that reads the contents of files) could throw a java heap space error if a query retrieves a big amount of files and several of these files have a large size. There is an input parameter for specifying the max file size that can be read to avoid these memory problems; if the size of a file is larger than the fileMaxSize setting, the contents of the file will not be read, just the rest of the fields. We can set the chunk size parameter in the Admin Tool Preferences-> Connection to avoid the java heap space error too, to reduce the number of rows that are sent to the driver in each block, thus we are reducing the size of used memory. If you go to read large files, you could set this parameter to about 10.
- If you use ReadFileSystemFTPConnector and the protocol SFTP the custom wrapper does not return the values of the parameters canwrite and canexecute.
Usage
Installing the custom wrapper
In order to use the denodofile-customwrapper first it is necessary to install its JAR file into the Denodo VDP server. For this, go to the VDP Administration tool, open the “File > Extensions” menu option and install the jar-with-dependencies file that is included in the distribution of the FileSystem custom wrapper.
Creating a new datasource
After this, create a new data source using “File > New > DataSource > Custom” menu option and fill the appropriate parameters:
- Name: type a name for the data source.
- Class name: choose one of the three available classes for the wrapper, depending on the capabilities that you need for your use case:
- com.denodo.connect.filesystem.ReadFileSystemConnector, to read metainformation of files.
- com.denodo.connect.filesystem.ReadWriteFileSystemConnector, to have access to the contents of files and/or create, update or delete files.
- com.denodo.connect.filesystem.ReadFileSystemFTPConnector, to read metainformation of files via FTP.
- Select Jars: check this checkbox and select in the list below it the JAR file that was installed in the first step.
- Click OK.
Creating Base Views
Once the datasource has been created, we can create base views on it. If the ReadFileSystemConnector has been chosen, this custom wrapper does not need input parameters to create a base view. So after accessing view to create base view we only can set the primary key or change the datatype of the parameters.
If ReadWriteFileSystemConnector has been chosen, you can fill the optional field fileMaxSize that represents, in bytes, the maximum size of the files whose contents will be read. If you do not fill this field, no limit will be established. The configuration of this parameter can avoid memory problems.
Note: Besides setting fileMaxSize, it is very important to configure the chunksize parameter in Admin Tool Preferences -> Connection; this way the amount of used memory can be reduced thanks to the specification of a limit in the number of rows that are being sent to the client in each block, thus we are reducing the size of used memory. To avoid memory problems when it is necessary to access large files or a big number of files, it is recommended to reduce this parameter to a more suitable value.
If ReadFileSystemFTPConnector has been chosen. Click to refresh the input parameters of the datasource and fill in the fields Host, Port, and User. All of them are necessary to establish the FTP/FTPS/SFTP connection. Besides, you can fill the field Password which may be mandatory for the specified user in order to access his/her home directory on the remote machine. There are also two check boxes: Establish the connection using the active mode that allows changing the mode of the FTP connection between the client and the server (passive is default and preferred), and Use deprecated FTP LIST commands instead of MLSD which enables compatibility with old FTP implementations like Microsoft’s WinNT FTP (IIS FTP) that do not implement the FTP MLSD command. You also should choose the Protocol among FTP, FTPS and SFTP. And if you use SFTP you can add the Key File and Passphrase associated with the key file.
The schema of the three custom wrappers contains these columns:
- parentfolder: The path of the parent directory, as a string. The custom wrapper will list all the files located in this directory
- This parameter is mandatory in a SELECT operation
- relativepath: The relative path is used when we execute a recursive query, and it expresses the situation of the file with respect to the parentfolder.
- filename: the name of the file or the folder, including the extension for files.
- extension: the extension of the file. It will be null if the file is a directory.
- fullpath: the full path of the file, calculated as CONCAT(parentfolder, relativepath, separator, filename), where separator is the path element separator native to the operating system.
- filetype: either ‘file’ or ‘directory’.
- hidden: true if it is a hidden file, false otherwise.
- datemodified: the timestamp when the file was last modified.
- canread: true if VDP can read the file.
- canwrite: true if VDP can modify the file.
- canexecute: true if VDP can execute the file.
- size: size of the file in bytes. It will be null for folders.
- recursive: if false the search for files will be limited to the files that are direct children of the parentfolder. If true, the search will be done recursively, including subfolders of parentfolder so any file that has parentfolder as ancestor will be included in the results. This parameter is mandatory in a SELECT operation
In Addition, in the ReadWriteFileSystemConnector there are also these columns:
- binarycontents: binary contents of the file.
- textcontents: text contents of the file, if it is a text file.
- append: when updating a file through an update query, if the append column is set to true the value specified in the textcontents column will be append to the end of the file; if set to false it will instead replace the current contents of the file.
- overwrite: when creating files through an insert query, if the overwrite column is set to false and the specified file already exists an error will be thrown; if set to true it will instead replace the existing file with the new contents.
- readencoding: the encoding to use text files, if you do not specify any encoding, it will be used platform default encoding.
In the following image we can see the view schema:
Examples of use
Select Operation
This is an example of a select operation that retrieves all the files in a folder. In this type of query the parameter parentfolder is mandatory and represents the starting point of the search.
In this example we search recursively for all the files in c:/temp2 folder:
SELECT *
FROM filesystem
WHERE (parentfolder = 'c:/temp2' AND recursive is true)
In the following image we can see the results of this query with the ReadFileSystemConnector custom wrapper (that only retrieves the file metadata):
In the following image we can see the results of this query with the ReadWriteFileSystemConnector custom wrapper (that retrieves both the metadata and the contents of the files):
If you use the ReadFileSystemFTPConnector custom wrapper the parameter parentfolder is required too and a slash (‘/’) or an empty string (‘’) represent the home directory.
In this example we search recursively for all the files in the home directory:
SELECT *
FROM ftpfilesystem
WHERE (parentfolder = '' and recursive is true)
If you use the custom wrapper ReadFileSystemFTPConnector custom wrapper (that only retrieves the file metadata), you have to fill in the parameters Host, Port, User, Password, Active Mode, LIST/MLSD and Protocol. If you choose SFTP you can fill in Key File and Passphrase.
Filtering files
This is an example with a more complex condition, where we search for files located in 'c:/temp2' recursively, with a size smaller than 600 and last modification date after '11-nov-2013 13h 46m 2s'
SELECT * FROM filesystem
WHERE parentfolder = 'C:\\Temp1'
and recursive = true
and size < 3000
and datemodified < 'Dec 5, 2014 9:48:49 AM'
CONTEXT ('i18n'='us_pst', 'cache_wait_for_load'='true')
UPDATE Operation
This is an example of an UPDATE query. This type of query is used for modifying the contents of a single file. It is mandatory to specify values for either the column fullpath or both the parentfolder + filename columns. It is only possible to update files one by one, and only using text contents. If the directories of the path do not exist, they will be created. And if the file does not exist, it will be created too, in this case UPDATE operation would be equivalent to the INSERT operation.
In this example we set 'This text file has been modified file' as the contents of the file 'c:/textfile.txt':
UPDATE filesystem_write
SET (textcontents, append) =
('This text file has been modified file',true)
WHERE parentfolder='c:/temp1' AND filename='textfile.txt'
INSERT Operation
This is an example of INSERT operation, used to create new files in the file system. In this type of query, it is mandatory to specify values for either the column fullpath or both the parentfolder + filename columns. It is only possible to insert files one by one, and only using text contents. If the directory where we want to insert the new file and its ancestors, they do not exist, they will be created.
In this example we create a new file in the folder ‘c:/’ with name ‘proba20.txt’, setting the contents to ‘If there is a file with this path, it will be overwritten’ and overwriting any previous file with that same name:
INSERT
INTO filesystem_write
(parentfolder, filename, textcontents, overwrite)
VALUES
('c:/temp1', 'textfile.txt', 'This is a text file', true)
DELETE Operation
This is an example of DELETE operation, used to delete files. In this type of query it is mandatory to specify values for either the column fullpath or both the parentfolder + filename columns. It is only possible to delete files one by one.
In this example we delete the file ‘c:/temp1/testfile.txt’:
DELETE
FROM filesystem_write
WHERE parentfolder = 'c:/temp1' AND filename = 'testfile.txt'
Passing the results of the filesystem custom wrapper to a delimited file data source
First, we have to create a delimited file data source by clicking on File -> New -> Data Source -> Delimited File. Then we need to configure the Data Route parameter: for this click on Configure. We introduce a variable in Local path, using the reserved character ‘@’, as in this case “@path” .
Click on Ok and after we have to choose a path from one of the files that we want to access.
Now, we have to set the Column Delimiter, and the other parameters, in this case we use the comma (‘,’)
Click on OK and finally, we create a base view and after we can configure the name of the columns, and its types.
After creating the Delimited File, we create a Join nested File -> New -> Join. First we add the base view of the file system custom wrapper, in this case we add, one of the first examples of this manual. Second we add the delimited file base view. After we link the fullpath parameter of the first view with the path parameter of the second view. We configure the join strategy in the Join Conditions tab changing the Join Method to Nested Join.
In the Where Conditions tab, we have to set the parentfolder because is a mandatory field, and select with the other conditions the files, that we want to access, in this example we select the files that their filename start with ‘test’
In the following image we can see the results of the execution of the Nested Join, the file system custom wrapper returns 1 rows, and on each delimited file, there are two rows too, so the VDP has received 2 rows.