Delimited File Sources¶
Use the delimited file data source (DF) to extract data from plain text files like these:
Comma-separated values files (CSV files). With this data source, the delimiter can be a comma or any other character.
Files that have a more complex structure that can be represented with a regular expression (e.g. application log files).
Fixed-width files. That is, files in which the columns do not have delimiters and instead, each column has a fixed width.
To create a new DF (delimited file) data source, right-click on the Server Explorer and click New > Data source > DF
The Tool will display the dialog to create the data source.
The following data are requested:
Name. Name of the new data source.
Data route. Path to the delimited file. The formats of the available paths are described in detail in the section Path Types in Virtual DataPort. The path can be parameterized using interpolation variables (see section Paths and Other Values with Interpolation Variables).
If the selected “Data route” is Local, FTP / SFTP / FTPS Client, HDFS, S3 or ADLS and the route points to a directory, Virtual DataPort infers the schema of the new base view from the first file in the directory and it assumes that all the other files have the same schema.
Note
The changes in the schema will not take effect until a source refresh is performed onto the base view.
Note
If in the configuration of the route, the filter is Decompress (zip format) and the .zip file has several files, the base view created over it will retrieve data from all the files in the compressed file. Thus, all the files inside the .zip file must have the same schema.
When the route points to a directory but you only want to process some of the files in this directory, enter a regular expression that matches the names of these files, in the File name pattern box. For example, if you want the base view created over a data source to return the data of all the files with the extension
log
located in the folderC:\log_files
, set the “Local path” toC:/log_files
and the “File name pattern” to(.*)\.log
. Note that “File name pattern” is a regular expression and not a file pattern. That is why “.” is prefixed with “" (in a regular expression, the dot matches any character unless is prefixed by a slash).Note
To retrieve data from files that use a date-based naming convention (as is usual in log files), use the
^DateRange
interpolation function. See the section Paths Using Date Ranges to learn to use this function.Ignore route errors. If selected, the data source will ignore the errors occurred when accessing the file(s) to which the data source points.
This main goal of this option is to ignore files that cannot be found, when the data source points to a collection of files and you know some of them may be missing. For example, you can create a DF data source to read a set of log files with this local path:
/opt/denodo/denodo/logs/apache-tomcat/http_access.^DateRange("yyyy/MM/dd",@start_date,@end_date,"yyyy-MM-dd").log
See more about “DateRange” in the section Paths Using Date Ranges.
When you query a base view created over this data source, the data source will read all the log files in order. For example, if in the query you put the condition
start_date='2018/05/01' AND end_date = '2018/05/04'
, the data source will try to read the files “http_access_2018-05-01.log”, “http_access_2018-05-02.log”, “http_access_2018-05-03.log” and “http_access_2018-05-04.log”. If one these files is missing, the query will fail.To ignore this error, select the check box Ignore route errors. With this option if one of the files does not exist, the data source will skip it and read the next one. If you run the query from the administration tool, you can identify which files could not be read in the Execution trace: in the trace, click on the nodes with Type = Route. The ones that could not be read will have the attribute Exception followed by an error message.
To Process CSV Files
To process a comma-separated values file (a CSV file), select Use column delimiter and in the box Column delimiter, enter the character that separates the values. Note that the delimiter can be a comma or another character. If you enter two or more characters, the behavior depends on the check box Delimiter consists of multiple characters:
If it is selected and the delimiter has two or more characters, the values have to be separated by these characters. For example, if you enter
-|-
, the file has to have this structure:value 1-|-value 2-|-value 3-|-value4
If it is cleared and the delimiter has two or more characters, all these characters will be considered a delimiter. For example, if you enter
,|
, every time the data source finds the character comma (,) or the vertical bar (|), it will consider the beginning of a new field.
You have to enter the “invisible” characters in a special way. See the table below:
Character |
Meaning |
---|---|
\t |
Tab |
\n |
Newline |
\r |
Carriage return |
\f |
Formfeed |
Note
The text qualifier is the double quote (“).
To Process Files With A Complex Structure (Regular Expression)
To process a file with a complex structure that can be represented with a regular expression, select Use tuple pattern and enter a regular expression in Tuple pattern. This is useful to process log files because usually, they have a fixed structure.
This regular expression specifies the format of one row of data within the file. This expression has to match the whole line of the file (or multiple lines), not only the part that you want to capture. The fields of the views created over this data source will be the capturing groups of the regular expression. For example:
(\d\d\d\d)-(\d\d)-(\d\d) - (.*)
The syntax of these regular expressions is the one defined by the Java language (the documentation of the Java class Pattern lists the constructs of these expressions).
The section Examples of How to Define a Tuple Pattern below has several examples of data sources with Tuple Pattern.
To Process Fixed-Width Files
To process a file in which the values have a fixed-width, select Use fixed length and provide the following values:
Column widths (bytes). Comma-separated list of integers. Each integer is the column size in bytes. E.g. if the file has seven columns, you have to enter 7 integers separated by a comma.
Note this is the width in bytes, not characters. Take this into account when processing files that have multi-byte characters. For example, “例” only has one character but it is made of two bytes.
Pad character. In fixed-width files, a value is padded - usually with spaces - if it does not use all the bytes allotted to it (there is extra spaces). If in this file, the values are padded with a character that is not space, select Use custom value and enter this character. If you enter more two or more characters, all of them will be considered pad characters.
Replace character. The character that the data source returns for values that cannot be represented in the specified charset. For most files, you can leave Use default value (whitespace) selected. If you enter a custom value, enter a single character.
This option rarely applies except in situations where the file has an unexpected encoding. Let us say the data source has column width = “1,1” (two columns of one byte each), you selected the encoding “UTF-8”, and when the data source reads the file the first character of a value is multi-byte. In this case, the first byte of a multi-byte character may not correspond to any character and if this happens, the data source will replace this with a space or the character you entered.
Alignment. The alignment of the values (left or right) when the values have padding.
End of line delimiter. Character string to be used to mark the end of a tuple. The default value is
\n
(carriage return).Note
To indicate the end of line, use
\n
, not\r
nor\r\n
, regardless of the operating system in which Virtual DataPort runs.Note
For fixed-width files, you can leave this parameter empty. If there are no end of lines in this file, leave this box empty.
Start of data zone delimiter. Java regular expression identifying the position in the file where the data source has to start retrieving data (or obtaining the header if the Header option is selected). If empty, the search will start at the beginning of the file.
Start delimiter from variable. If selected, the “Start of data zone delimiter” will be considered the name of an interpolation variable that, at runtime, will contain the “Data zone delimiter”.
Include start delimiter as data. If selected, the text matching the “Start of data zone delimiter” expression will be included in the search space.
End of data zone delimiter. The data source will stop retrieving data from the file when it finds this string. If empty, the data source will continue retrieving data until the end of the file.
Include end delimiter as data. If selected, the text matching the “End of data zone delimiter” expression will be considered in the results.
Header. If selected, the data source considers that the first line of the data region contains the names of the fields in this file. These names will be the fields’ names of the base views created from this data source.
Header pattern. Java regular expression used to extract the name of the fields that form the header. This only needs to be specified if the header has a different structure than the data. This option can only be used when the “Header” option is selected.
Ignore matching errors. If selected, the data source will ignore the lines of this data file that do not have the expected structure. I.e. rows that do not have the expected number of columns or, if you are providing a tuple pattern, rows that do not match the pattern.
If you clear this check box, the data source will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the data source has ignored any row in a query. To do this, execute the query from the Administration Tool. Then, click “View execution trace” and click the “Route” node. You will see the attribute “Number of invalid tuples”.
In the Metadata tab, you can:
Set the folder where the data source will be stored.
Provide a description.
Set the infrastructure information where the data is stored. You can set the name of the infrastructure provider (e.g. on-premise, Amazon Web Services, Microsoft Azure, Google Cloud Platform, etc.) and the region name. You can type a custom name if the provider name or region is not on the list.
When editing the data source, you can also change its owner by clicking the button .
Click Save to create the data source.
Then, click Create base view to create a base view associated with the new data source. If the path to the data file includes interpolation variables, you will have to provide a value for them (the section Paths and Other Values with Interpolation Variables explains how to create paths to files with variables).
The Tool will display the schema that the base view will have. At this point, you can change the name of the view and the name and type of its attributes. In the Metadata tab, click Browse to select the folder where the new base view will be stored. Then, click Save ().
In the Server Explorer, double-click the new base view to display its schema. Click Edit to open the edition wizard of the view. In this wizard, you can change the name and type of the base view.
Examples of How to Define a Tuple Pattern¶
This section contains two examples of delimited-file data sources that are defined with a Tuple Pattern. Although in these examples we use a static value, the value of Tuple Pattern can be an interpolation variable (see the section Paths and Other Values with Interpolation Variables for more information about interpolation variables).
Example 1 of tuple pattern: Let us say that we have a file that
contains product information in the following format (note that the
discount
attribute is optional):
product_name=Acme Laptop Computer;price=1500 euro;discount=50
product_name=Acme Desktop Computer;price=1000 dollar
The following pattern can be used to extract from each row the following information about each product:
Its name
Price and currency
Discount. For the tuples without a discount value, the value of this cell will be
NULL
.
Product_name=(.+);price=([0-9]+)\s(.+);(?:discount=(.+))?
Example 2 of tuple pattern: Let us say that we want to extract the
name of the files (not directories), its date and its size, from the
output of the Windows command DIR
:
11/07/2007 10:10 <DIR> .dbvis
09/18/2008 15:09 <DIR> .eclipse
01/19/2011 16:55 <DIR> .gimp-2.6
11/10/2009 18:43 215 .ITLMRegistry
03/26/2010 14:16 3.498 .keystore
05/18/2010 17:56 <DIR> .m2
02/02/2010 15:23 <DIR> .maven
03/26/2010 14:01 <DIR> .netbeans
02/02/2011 19:20 <DIR> .smc_cache
06/15/2010 09:59 <DIR> .ssh
10/14/2009 13:26 <DIR> .thumbnails
02/15/2010 12:06 0 .Xauthority
01/16/2008 12:02 517 ant.install.log
02/11/2010 13:29 <DIR> Application Data
07/16/2010 08:51 772 build.properties
02/18/2008 15:19 <DIR> Contacts
01/14/2011 10:02 190 default-soapui-workspace.xml
02/07/2011 11:44 <DIR> Desktop
04/01/2009 15:11 <DIR> Favourites
08/22/2008 12:50 <DIR> Start Menu
01/27/2011 17:18 <DIR> My Documents
02/12/2009 12:36 201 osdadmin.ini
01/14/2011 10:02 7.958 soapui-settings.xml
02/09/2010 10:02 22.358 temp.txt
02/10/2011 09:22 <DIR> Tracing
03/05/2010 09:41 0 vdpws.log
04/17/2009 09:49 <DIR> workspace
The “Tuple pattern” has to be:
(\d{2}?/\d{2}?/\d{4}?)\s+?\d{2}:\d{2}.*?(\d[^ ]*)\s(.*)
The base views created with this tuple pattern will have three fields: the date of the file, its size and its name.
Paths Using Date Ranges¶
When you need access to files that use a date-based naming convention
(as is typical in log files), use the ^DateRange
interpolation
function to consider only the files between a given start date and a
given end date.
The syntax of the ^DateRange
function is the following:
^DateRange (
pattern of the date range : text
, start date : text
, end date : text
, pattern of the files : text
)
Pattern of the date range
: pattern of the parametersstart date
andend date
. This pattern follows the syntax specified by the class SimpleDateFormat of the Java API.For example,
yyyy-MM-dd
is the pattern for <year (4 digits)>-<month in year (2 digits)>-<day in month (2 digits)>. E.g., “2014-01-31”.This parameter can be a literal or an interpolation variable.
Start date
: initial date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.End date
: finish date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.Pattern of the file names
: pattern of the file names. This parameter can be a literal or an interpolated variable.
Important
When using this function, follow these two rules:
The literal parameters of the
DateRange
function have to be surrounded by double quotes. With this function you cannot use single quotes.You cannot leave any space between the parameters of the
DateRange
function.
Example 1
Let us say that the C:/logs
directory contains the log files
generated daily by an application. The name of these files follows the
pattern application_log_<year>-<month>-<date>.log
(e.g.
application_log_2014-11-19.log
).
To create a data source that reads all the logs of January 2014, set the path of the data source to the following:
C:/logs/application_log_^DateRange("yyyy/MM/dd","2014/01/01","2014/01/31","yyyy-MM-dd").log
Note that the value of the first parameter (yyyy/MM/dd
) is the
pattern followed by the start date and end date parameters of the
function.
A base view created over this data source will return the data “in
order”. That is, the data source reads the file of the data range, which
is the first of January (application_log_2014-01-01.log
). Then,
reads the file of the next day (application_log_2014-01-02.log
),
then the file of the next day (application_log_2014-01-03.log
), etc.
At runtime, if one of the files of the date range is missing and “Ignore route
errors” check box is not selected, the query will return an error but it will
also return the contents of all the files it found. For example, if the file
application_log_2014-01-15.log
does not exist, a query to this data source
will return an error message that explains the situation, but the result will
contain the data read from all the other files. I.e.
application_log_2014-01-16.log
Example 2
In the previous example, the date range of the files was fixed and it
could not be changed at runtime. If you want this range to be dynamic,
you can set the start date
and end date
parameters to be
interpolation variables. E.g.
C:/logs/application_log_^DateRange("MM/dd/yyyy",@start_date,@end_date,"yyyy-MM-dd").log
The base views created over this data source will have two extra fields
called start_date
and end_date
whose value will set the date
range. For instance, the following query
SELECT *
FROM bv_application_log
WHERE start_date = '01/01/2014' AND end_date = '03/31/2014'
will make the data source to process the log files of the first quarter of 2014.
Note that when a parameter is an interpolation variable, you do not have to add double quotes.
Example 3
^DateRange
can also be used with paths that point to directories
instead of a group of files.
For instance, if the logs of each day are stored in a separate directory
with the naming convention yyyyMMdd
, set the path of the data source
to this:
C:/tmp/^DateRange("yyyy/MM/dd",@startdate,@enddate,"yyyyMMdd")
The base views created over this data source will read all the files from every directory in the specified date range.