Delimited File Sources¶
This type of source retrieves data from text files in CSV (Comma Separated Values) format or similar. Complex regular expressions can also be used to extract the desired data from other kinds of text files (e.g. application log files).
To create a new DF (delimited file) data source, right-click on the Server Explorer and click New > Data source > DF
The Tool will display the dialog to create the data source.
The following data are requested:
Name. Name of the new data source.
Data route. Path to the delimited file. The formats of the available paths are described in detail in the section Path Types in Virtual DataPort. The path can be parameterized using interpolation variables (see section Paths and Other Values with Interpolation Variables).
If the selected “Data route” is Local or FTP / SFTP / FTPS Client and the route points to a directory, the base views created over this data source will retrieve the data from all the files in the directory and not just one file. The data in all the files must have the same schema.
loglocated in the folder
C:\log_files, set the “Local path” to
C:\log_filesand the “File name pattern” to
(.*)\.log. Note that “File name pattern” is a regular expression and not a file pattern.
When you need to retrieve data from files that use a date-based naming convention (as is usual in log files), use the
^DateRangeinterpolation function. See the section Paths Using Date Ranges to learn to use this function.
Ignore route errors. If selected, the Server will ignore the errors occurred when accessing the file(s) to which the data source points.
This option is not meant to be used when the data source reads a single file. Its main purpose is when the data source points to a collection of files and you know some of them may be missing. For example, you can create a DF data source to read a set of log files with this local path:
/opt/denodo/denodo/logs/apache-tomcat/http_access.^DateRange("yyyy/MM/dd",@start_date,@end_date,"yyyy-MM-dd").log(see more about “DateRange” in the section Paths Using Date Ranges).
When you query a base view created over this data source, the data source will read all the log files in order. For example, if in the query you put the condition
start_date='2018/05/01' AND end_date = '2018/05/04', the data source will try to read the files “http_access_2018-05-01.log”, “http_access_2018-05-02.log”, “http_access_2018-05-03.log” and “http_access_2018-05-04.log”. If one these files is missing, the query will fail.
If you want to ignore this error, select the check box Ignore route errors. With this option if one of the files does not exist, the data source will skip it and read the next one. If you run the query from the administration tool, you can identify which files could not be read in the Execution trace: in the trace, click on the nodes with Type = Route. The ones that could not be read will have the attribute Exception followed by an error message.
Column delimiter. Character that separates the values of a row.
If you enter more than one character, all these characters will be considered a delimiter unless you select the check box Delimiter consists of multiple characters. For example, if you enter
,|, every time the data source finds the character comma (,) or the vertical bar (|), it will consider the beginning of a new field.
Some “invisible” characters have to be entered in a special way. See the table below:
The text qualifier is the double quote (“).
Tuple pattern. Regular expression that specifies the format of the tuples that will be extracted from the delimited file. This regular expression has to match the whole line of the file, not only the part that you want to capture. The fields of the views created over this data source will be the capturing groups of the regular expression.
The syntax of these regular expressions is the one defined by the Java language (the documentation of the Java class Pattern lists the constructs of these expressions).
The section Examples of How to Define a Tuple Pattern has several examples of data sources with Tuple Pattern.
Delimiter consists of multiple characters. If selected, the character delimiter can be multiple characters. For example,
~|would now be a valid delimiter. If cleared, enter only one character in Column delimiter or all of them will be considered a delimiter.
End of line delimiter. Character string to be used to mark the end of a tuple. The default value is
\rcannot be used as an end of line delimiter. You always have to indicate the end of line delimiter with
\nregardless of the operating system where Virtual DataPort is running.
Start of data zone delimiter. Java regular expression identifying the position in the file where the Server has to start retrieving data (or obtaining the header if the Header option is selected). If empty, the search will start at the beginning of the file.
Start delimiter from variable. If selected, the “Start of data zone delimiter” will be considered the name of an interpolation variable that, at runtime, will contain the “Data zone delimiter”.
Include start delimiter as data. If selected, the text matching the “Start of data zone delimiter” expression will be included in the search space.
End of data zone delimiter. The Server stops retrieving data from the file when it finds this string. If empty, the Server will continue retrieving data until the end of the file.
Include end delimiter as data. If selected, the text matching the “End of data zone delimiter” expression will be considered in the results.
Header. If selected, the Server considers that the first line of the data region contains the names of the fields in this file. These names will be the fields’ names of the base views created from this data source.
Header pattern. Java regular expression used to extract the name of the fields that form the header. This only needs to be specified if the header has a different structure than the data. This option can only be used when the “Header” option is selected.
Ignore matching errors. If selected, when a query involves this data source, the Server will ignore the lines of this data file that do not have the expected structure. I.e. rows that do not have the expected number of columns or, if you are providing a tuple pattern, rows that do not match the pattern.
If you clear this check box, the Server will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the Server has ignored any row in a query. To do this, execute the query from the Administration Tool. Then, click “View execution trace” and click the “Route” node. You will see the attribute “Number of invalid tuples”.
In the Metadata tab, you can set the folder where the data source will be stored and provide a description.
When editing the data source, you can also change its owner by clicking the button .
Click Save to create the data source.
Then, click Create base view to create a base view associated with the new data source. If the path to the data file includes interpolation variables, you will have to provide a value for them (the section Paths and Other Values with Interpolation Variables explains how to create paths to files with variables).
The Tool will display the schema that the base view will have. At this point, you can change the name of the view and the name and type of its attributes. In the Metadata tab, click Browse to select the folder where the new base view will be stored. Then, click Save ().
In the Server Explorer, double-click the new base view to display its schema. Click Edit to open the edition wizard of the view. In this wizard, you can change the name and type of the base view.
Examples of How to Define a Tuple Pattern¶
This section contains two examples of delimited-file data sources that are defined with a Tuple Pattern. Although in these examples we use a static value, the value of Tuple Pattern can be an interpolation variable (see the section Paths and Other Values with Interpolation Variables for more information about interpolation variables).
Example 1 of tuple pattern: Let us say that we have a file that
contains product information in the following format (note that the
discount attribute is optional):
product_name=Acme Laptop Computer;price=1500 euro;discount=50 product_name=Acme Desktop Computer;price=1000 dollar
The following pattern can be used to extract from each row the following information about each product:
Price and currency
Discount. For the tuples without a discount value, the value of this cell will be
Example 2 of tuple pattern: Let us say that we want to extract the
name of the files (not directories), its date and its size, from the
output of the Windows command
11/07/2007 10:10 <DIR> .dbvis 09/18/2008 15:09 <DIR> .eclipse 01/19/2011 16:55 <DIR> .gimp-2.6 11/10/2009 18:43 215 .ITLMRegistry 03/26/2010 14:16 3.498 .keystore 05/18/2010 17:56 <DIR> .m2 02/02/2010 15:23 <DIR> .maven 03/26/2010 14:01 <DIR> .netbeans 02/02/2011 19:20 <DIR> .smc\_cache 06/15/2010 09:59 <DIR> .ssh 10/14/2009 13:26 <DIR> .thumbnails 02/15/2010 12:06 0 .Xauthority 01/16/2008 12:02 517 ant.install.log 02/11/2010 13:29 <DIR> Application Data 07/16/2010 08:51 772 build.properties 02/18/2008 15:19 <DIR> Contacts 01/14/2011 10:02 190 default-soapui-workspace.xml 09/16/2010 16:35 78.170.461 denodo-v46-update-201009161734.jar 02/07/2011 11:44 <DIR> Desktop 04/01/2009 15:11 <DIR> Favourites 08/22/2008 12:50 <DIR> Start Menu 01/27/2011 17:18 <DIR> My Documents 02/12/2009 12:36 201 osdadmin.ini 02/12/2009 12:35 201 osdadmin.ini.bak 01/14/2011 10:02 7.958 soapui-settings.xml 02/09/2010 10:02 22.358 temp.txt 04/19/2010 18:55 583 test.cert 02/10/2011 09:22 <DIR> Tracing 03/05/2010 09:41 0 vdpws.log 04/17/2009 09:49 <DIR> workspace
The “Tuple pattern” has to be:
The base views created with this tuple pattern will have three fields: the date of the file, its size and its name.
Paths Using Date Ranges¶
When you need access to files that use a date-based naming convention
(as is typical in log files), use the
function to consider only the files between a given start date and a
given end date.
The syntax of the
^DateRange function is the following:
^DateRange ( pattern of the date range : text , start date : text , end date : text , pattern of the files : text )
Pattern of the date range: pattern of the parameters
end date. This pattern follows the syntax specified by the class SimpleDateFormat of the Java API.
yyyy-MM-ddis the pattern for <year (4 digits)>-<month in year (2 digits)>-<day in month (2 digits)>. E.g., “2014-01-31”.
This parameter can be a literal or an interpolation variable.
Start date: initial date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.
End date: finish date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.
Pattern of the file names: pattern of the file names. This parameter can be a literal or an interpolated variable.
When using this function, follow these two rules:
The literal parameters of the
DateRangefunction have to be surrounded by double quotes. With this function you cannot use single quotes.
You cannot leave any space between the parameters of the
Let us say that the
C:/logs directory contains the log files
generated daily by an application. The name of these files follows the
To create a data source that reads all the logs of January 2014, set the path of the data source to the following:
Note that the value of the first parameter (
yyyy/MM/dd) is the
pattern followed by the start date and end date parameters of the
A base view created over this data source will return the data “in
order”. That is, the data source reads the file of the data range, which
is the first of January (
reads the file of the next day (
then the file of the next day (
At runtime, if one of the files of the date range is missing and “Ignore route
errors” check box is not selected, the query will return an error but it will
also return the contents of all the files it found. For example, if the file
application_log_2014-01-15.log does not exist, a query to this data source
will return an error message that explains the situation, but the result will
contain the data read from all the other files. I.e.
In the previous example, the date range of the files was fixed and it
could not be changed at runtime. If you want this range to be dynamic,
you can set the
start date and
end date parameters to be
interpolation variables. E.g.
The base views created over this data source will have two extra fields
end_date whose value will set the date
range. For instance, the following query
SELECT * FROM bv_application_log WHERE start_date = '01/01/2014' AND end_date = '03/31/2014'
will make the data source to process the log files of the first quarter of 2014.
Note that when a parameter is an interpolation variable, you do not have to add double quotes.
^DateRange can also be used with paths that point to directories
instead of a group of files.
For instance, if the logs of each day are stored in a separate directory
with the naming convention
yyyyMMdd, set the path of the data source
The base views created over this data source will read all the files from every directory in the specified date range.