Delimited File Sources

Use the delimited file data source (DF) to extract data from plain text files like these:

  • Comma-separated values files (CSV files). With this data source, the delimiter can by a comma or any other character.

  • Files that have a more complex structure that can be represented with a regular expression (e.g. application log files).

  • Fixed-width files. That is, files in which the columns do not have delimiters and instead, each column has a fixed width.

To create a new DF (delimited file) data source, right-click on the Server Explorer and click New > Data source > DF

The Tool will display the dialog to create the data source.

Importing from a new delimited file source

Importing from a new delimited file source

The following data are requested:

  • Name. Name of the new data source.

  • Data route. Path to the delimited file. The formats of the available paths are described in detail in the section Path Types in Virtual DataPort. The path can be parameterized using interpolation variables (see section Paths and Other Values with Interpolation Variables).

    If the selected “Data route” is Local or FTP / SFTP / FTPS Client and the route points to a directory, the base views created over this data source will retrieve the data from all the files in the directory and not just one file. The data in all the files must have the same schema.

    When the route points to a directory but you only want to process some of the files in this directory, enter a regular expression that matches the names of these files, in the File name pattern box. For example, if you want the base view created over a data source to return the data of all the files with the extension log located in the folder C:\log_files, set the “Local path” to C:/log_files and the “File name pattern” to (.*)\.log. Note that “File name pattern” is a regular expression and not a file pattern. That is why “.” is prefixed with “” (in a regular expression, the dot matches any character unless is prefixed by a slash).

    Note

    To retrieve data from files that use a date-based naming convention (as is usual in log files), use the ^DateRange interpolation function. See the section Paths Using Date Ranges to learn to use this function.

  • Ignore route errors. If selected, the data source will ignore the errors occurred when accessing the file(s) to which the data source points.

    This main goal of this option is to ignore files that cannot be found, when the data source points to a collection of files and you know some of them may be missing. For example, you can create a DF data source to read a set of log files with this local path:

    /opt/denodo/denodo/logs/apache-tomcat/http_access.^DateRange("yyyy/MM/dd",@start_date,@end_date,"yyyy-MM-dd").log
    

    See more about “DateRange” in the section Paths Using Date Ranges.

    When you query a base view created over this data source, the data source will read all the log files in order. For example, if in the query you put the condition start_date='2018/05/01' AND end_date = '2018/05/04', the data source will try to read the files “http_access_2018-05-01.log”, “http_access_2018-05-02.log”, “http_access_2018-05-03.log” and “http_access_2018-05-04.log”. If one these files is missing, the query will fail.

    If you want to ignore this error, select the check box Ignore route errors. With this option if one of the files does not exist, the data source will skip it and read the next one. If you run the query from the administration tool, you can identify which files could not be read in the Execution trace: in the trace, click on the nodes with Type = Route. The ones that could not be read will have the attribute Exception followed by an error message.

CSV Files

To process a comma-separated values file (a CSV file), select Use column delimiter and in the box Column delimiter, enter the character that separates the values. Note that the delimiter can be a comma or another character. If you enter two or more characters, the behavior depends on the check box Delimiter consists of multiple characters:

  • If it is selected and the delimiter has two or more characters, the values have to be separated by these characters. For example, if you enter -|-, the file has to have this structure:

    value 1-|-value 2-|-value 3-|-value4
    
  • If it is cleared and the delimiter has two or more characters, all these characters will be considered a delimiter. For example, if you enter ,|, every time the data source finds the character comma (,) or the vertical bar (|), it will consider the beginning of a new field.

You have to enter the “invisible” characters in a special way. See the table below:

Invisible characters as columns delimiter

Character

Meaning

\t

Tab

\n

Newline

\r

Carriage return

\f

Formfeed

Note

The text qualifier is the double quote (“).

Files With Complex Structure (Regular Expression)

To process a file with a complex structure that can be represented with a regular expression, select Use tuple pattern and enter a regular expression in Tuple pattern. This is useful to process log files because usually, they have a fixed structure.

This regular expression specifies the format of one row of data within the file. This expression has to match the whole line of the file (or multiple lines), not only the part that you want to capture. The fields of the views created over this data source will be the capturing groups of the regular expression. For example:

(\d\d\d\d)-(\d\d)-(\d\d) - (.*)

The syntax of these regular expressions is the one defined by the Java language (the documentation of the Java class Pattern lists the constructs of these expressions).

The section Examples of How to Define a Tuple Pattern below has several examples of data sources with Tuple Pattern.

Fixed-Width File

To process a file in which the values have a fixed-width, select Use fixed length and provide the following values:

  • Column widths (bytes). Comma-separated list of integers. Each integer is the column size in bytes. E.g. if the file has seven columns, you have to enter 7 integers separated by a comma.

    Note this is the width in bytes, not characters. Take this into account when processing files that have multi-byte characters. For example, “例” only has one character but it is made of two bytes.

  • Pad character. In fixed-width files, a value is padded - usually with spaces - if it does not use all the bytes allotted to it (there is extra spaces). If in this file, the values are padded with a character that is not space, select Use custom value and enter this character. If you enter more two or more characters, all of them will be considered pad characters.

  • Replace character. The character that the data source returns for values that cannot be represented in the specified charset. For most files, you can leave Use default value (whitespace) selected. If you enter a custom value, enter a single character.

    This option rarely applies except in situations where the file has an unexpected encoding. Let us say the data source has column width = “1,1” (two columns of one byte each), you selected the encoding “UTF-8”, and when the data source reads the file the first character of a value is multi-byte. In this case, the first byte of a multi-byte character may not correspond to any character and if this happens, the data source will replace this with a space or the character you entered.

  • Alignment. The alignment of the values (left or right) when the values have padding.


  • End of line delimiter. Character string to be used to mark the end of a tuple. The default value is \n (carriage return).

    Note

    To indicate the end of line, use \n nor \r nor \r\`n, regardless of the operation system in which Virtual DataPort runs.

Note

For fixed-width files, you can leave this parameter empty. If there are no end of lines in this file, leave this box empty.

  • Start of data zone delimiter. Java regular expression identifying the position in the file where the data source has to start retrieving data (or obtaining the header if the Header option is selected). If empty, the search will start at the beginning of the file.

  • Start delimiter from variable. If selected, the “Start of data zone delimiter” will be considered the name of an interpolation variable that, at runtime, will contain the “Data zone delimiter”.

  • Include start delimiter as data. If selected, the text matching the “Start of data zone delimiter” expression will be included in the search space.

  • End of data zone delimiter. The data source will stop retrieving data from the file when it finds this string. If empty, the data source will continue retrieving data until the end of the file.

  • Include end delimiter as data. If selected, the text matching the “End of data zone delimiter” expression will be considered in the results.

  • Header. If selected, the data source considers that the first line of the data region contains the names of the fields in this file. These names will be the fields’ names of the base views created from this data source.

  • Header pattern. Java regular expression used to extract the name of the fields that form the header. This only needs to be specified if the header has a different structure than the data. This option can only be used when the “Header” option is selected.

  • Ignore matching errors. If selected, the data source will ignore the lines of this data file that do not have the expected structure. I.e. rows that do not have the expected number of columns or, if you are providing a tuple pattern, rows that do not match the pattern.

    If you clear this check box, the data source will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the data source has ignored any row in a query. To do this, execute the query from the Administration Tool. Then, click “View execution trace” and click the “Route” node. You will see the attribute “Number of invalid tuples”.

In the Metadata tab, you can set the folder where the data source will be stored and provide a description.

When editing the data source, you can also change its owner by clicking the button image1.

Click Save to create the data source.

Then, click Create base view to create a base view associated with the new data source. If the path to the data file includes interpolation variables, you will have to provide a value for them (the section Paths and Other Values with Interpolation Variables explains how to create paths to files with variables).

The Tool will display the schema that the base view will have. At this point, you can change the name of the view and the name and type of its attributes. In the Metadata tab, click Browse to select the folder where the new base view will be stored. Then, click Save (image2).

In the Server Explorer, double-click the new base view to display its schema. Click Edit to open the edition wizard of the view. In this wizard, you can change the name and type of the base view.

Examples of How to Define a Tuple Pattern

This section contains two examples of delimited-file data sources that are defined with a Tuple Pattern. Although in these examples we use a static value, the value of Tuple Pattern can be an interpolation variable (see the section Paths and Other Values with Interpolation Variables for more information about interpolation variables).

Example 1 of tuple pattern: Let us say that we have a file that contains product information in the following format (note that the discount attribute is optional):

Example 1 of tuple pattern: contents of the delimited file
product_name=Acme Laptop Computer;price=1500 euro;discount=50
product_name=Acme Desktop Computer;price=1000 dollar

The following pattern can be used to extract from each row the following information about each product:

  • Its name

  • Price and currency

  • Discount. For the tuples without a discount value, the value of this cell will be NULL.

Example 1 of tuple pattern: regular expression to extract the contents of the file
Product_name=(.+);price=([0-9]+)\s(.+);(?:discount=(.+))?

Example 2 of tuple pattern: Let us say that we want to extract the name of the files (not directories), its date and its size, from the output of the Windows command DIR:

Example 2 of tuple pattern: output of the Windows command DIR
11/07/2007 10:10 <DIR>           .dbvis
09/18/2008 15:09 <DIR>          .eclipse
01/19/2011 16:55 <DIR>          .gimp-2.6
11/10/2009 18:43         215   .ITLMRegistry
03/26/2010 14:16       3.498   .keystore
05/18/2010 17:56 <DIR>          .m2
02/02/2010 15:23 <DIR>          .maven
03/26/2010 14:01 <DIR>          .netbeans
02/02/2011 19:20 <DIR>          .smc_cache
06/15/2010 09:59 <DIR>          .ssh
10/14/2009 13:26 <DIR>          .thumbnails
02/15/2010 12:06           0   .Xauthority
01/16/2008 12:02         517    ant.install.log
02/11/2010 13:29 <DIR>          Application Data
07/16/2010 08:51         772    build.properties
02/18/2008 15:19 <DIR>          Contacts
01/14/2011 10:02         190    default-soapui-workspace.xml
02/07/2011 11:44 <DIR>          Desktop
04/01/2009 15:11 <DIR>          Favourites
08/22/2008 12:50 <DIR>          Start Menu
01/27/2011 17:18 <DIR>          My Documents
02/12/2009 12:36         201   osdadmin.ini
01/14/2011 10:02         7.958   soapui-settings.xml
02/09/2010 10:02        22.358   temp.txt
02/10/2011 09:22 <DIR>          Tracing
03/05/2010 09:41             0   vdpws.log
04/17/2009 09:49 <DIR>          workspace

The “Tuple pattern” has to be:

Example 2 of tuple pattern: regular expression to extract the output of the command DIR
(\d{2}?/\d{2}?/\d{4}?)\s+?\d{2}:\d{2}.*?(\d[^ ]*)\s(.*)

The base views created with this tuple pattern will have three fields: the date of the file, its size and its name.

Paths Using Date Ranges

When you need access to files that use a date-based naming convention (as is typical in log files), use the ^DateRange interpolation function to consider only the files between a given start date and a given end date.

The syntax of the ^DateRange function is the following:

^DateRange (
    pattern of the date range : text
  , start date : text
   , end date : text
   , pattern of the files : text
   )
  • Pattern of the date range: pattern of the parameters start date and end date. This pattern follows the syntax specified by the class SimpleDateFormat of the Java API.

    For example, yyyy-MM-dd is the pattern for <year (4 digits)>-<month in year (2 digits)>-<day in month (2 digits)>. E.g., “2014-01-31”.

    This parameter can be a literal or an interpolation variable.

  • Start date: initial date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.

  • End date: finish date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.

  • Pattern of the file names: pattern of the file names. This parameter can be a literal or an interpolated variable.

Important

When using this function, follow these two rules:

  1. The literal parameters of the DateRange function have to be surrounded by double quotes. With this function you cannot use single quotes.

  2. You cannot leave any space between the parameters of the DateRange function.

Example 1

Let us say that the C:/logs directory contains the log files generated daily by an application. The name of these files follows the pattern application_log_<year>-<month>-<date>.log (e.g. application_log_2014-11-19.log).

To create a data source that reads all the logs of January 2014, set the path of the data source to the following:

C:/logs/application_log_^DateRange("yyyy/MM/dd","2014/01/01","2014/01/31","yyyy-MM-dd").log

Note that the value of the first parameter (yyyy/MM/dd) is the pattern followed by the start date and end date parameters of the function.

A base view created over this data source will return the data “in order”. That is, the data source reads the file of the data range, which is the first of January (application_log_2014-01-01.log). Then, reads the file of the next day (application_log_2014-01-02.log), then the file of the next day (application_log_2014-01-03.log), etc.

At runtime, if one of the files of the date range is missing and “Ignore route errors” check box is not selected, the query will return an error but it will also return the contents of all the files it found. For example, if the file application_log_2014-01-15.log does not exist, a query to this data source will return an error message that explains the situation, but the result will contain the data read from all the other files. I.e. application_log_2014-01-16.log

Example 2

In the previous example, the date range of the files was fixed and it could not be changed at runtime. If you want this range to be dynamic, you can set the start date and end date parameters to be interpolation variables. E.g.

C:/logs/application_log_^DateRange("MM/dd/yyyy",@start_date,@end_date,"yyyy-MM-dd").log

The base views created over this data source will have two extra fields called start_date and end_date whose value will set the date range. For instance, the following query

SELECT *
FROM bv_application_log
WHERE start_date = '01/01/2014' AND end_date = '03/31/2014'

will make the data source to process the log files of the first quarter of 2014.

Note that when a parameter is an interpolation variable, you do not have to add double quotes.

Example 3

^DateRange can also be used with paths that point to directories instead of a group of files.

For instance, if the logs of each day are stored in a separate directory with the naming convention yyyyMMdd, set the path of the data source to this:

C:/tmp/^DateRange("yyyy/MM/dd",@startdate,@enddate,"yyyyMMdd")

The base views created over this data source will read all the files from every directory in the specified date range.