Delimited File Sources¶
Use the delimited file data source (DF) to extract data from plain text files like these:
Comma-separated values files (CSV files). With this data source, the delimiter can be a comma or any other character.
Files that have a more complex structure that can be represented with a regular expression (e.g. application log files).
Fixed-width files. That is, files in which the columns do not have delimiters and instead, each column has a fixed width.
To create a new DF (delimited file) data source, right-click on the Server Explorer and click New > Data source > DF
The Tool will display the dialog to create the data source.
The following data are requested:
Name. Name of the new data source.
Data route. Path to the delimited file. The formats of the available paths are described in detail in the section Path Types in Virtual DataPort. The path can be parameterized using interpolation variables (see section Paths and Other Values with Interpolation Variables).
If the selected “Data route” is Local, FTP / SFTP / FTPS Client, HDFS or S3 and the route points to a directory, the base views created over this data source will retrieve the data from all the files in the directory and not just one file. Thus, a base view schema will be created including all different fields in all the files at that time.
The changes in the schema will not take effect until a source refresh is performed onto the base view.
If in the configuration of the route, the filter is Decompress (zip format) and the .zip file has several files, the base view created over it will retrieve data from all the files in the compressed file. Thus, all the files inside the .zip file must have the same schema.
When the route points to a directory but you only want to process some of the files in this directory, enter a regular expression that matches the names of these files, in the File name pattern box. For example, if you want the base view created over a data source to return the data of all the files with the extension
loglocated in the folder
C:\log_files, set the “Local path” to
C:/log_filesand the “File name pattern” to
(.*)\.log. Note that “File name pattern” is a regular expression and not a file pattern. That is why “.” is prefixed with “" (in a regular expression, the dot matches any character unless is prefixed by a slash).
To retrieve data from files that use a date-based naming convention (as is usual in log files), use the
^DateRangeinterpolation function. See the section Paths Using Date Ranges to learn to use this function.
Ignore route errors. If selected, the data source will ignore the errors occurred when accessing the file(s) to which the data source points.
This main goal of this option is to ignore files that cannot be found, when the data source points to a collection of files and you know some of them may be missing. For example, you can create a DF data source to read a set of log files with this local path:
See more about “DateRange” in the section Paths Using Date Ranges.
When you query a base view created over this data source, the data source will read all the log files in order. For example, if in the query you put the condition
start_date='2018/05/01' AND end_date = '2018/05/04', the data source will try to read the files “http_access_2018-05-01.log”, “http_access_2018-05-02.log”, “http_access_2018-05-03.log” and “http_access_2018-05-04.log”. If one these files is missing, the query will fail.
If you want to ignore this error, select the check box Ignore route errors. With this option if one of the files does not exist, the data source will skip it and read the next one. If you run the query from the administration tool, you can identify which files could not be read in the Execution trace: in the trace, click on the nodes with Type = Route. The ones that could not be read will have the attribute Exception followed by an error message.
To Process CSV Files
To process a comma-separated values file (a CSV file), select Use column delimiter and in the box Column delimiter, enter the character that separates the values. Note that the delimiter can be a comma or another character. If you enter two or more characters, the behavior depends on the check box Delimiter consists of multiple characters:
If it is selected and the delimiter has two or more characters, the values have to be separated by these characters. For example, if you enter
-|-, the file has to have this structure:
value 1-|-value 2-|-value 3-|-value4
If it is cleared and the delimiter has two or more characters, all these characters will be considered a delimiter. For example, if you enter
,|, every time the data source finds the character comma (,) or the vertical bar (|), it will consider the beginning of a new field.
You have to enter the “invisible” characters in a special way. See the table below:
The text qualifier is the double quote (“).
To Process Files With A Complex Structure (Regular Expression)
To process a file with a complex structure that can be represented with a regular expression, select Use tuple pattern and enter a regular expression in Tuple pattern. This is useful to process log files because usually, they have a fixed structure.
This regular expression specifies the format of one row of data within the file. This expression has to match the whole line of the file (or multiple lines), not only the part that you want to capture. The fields of the views created over this data source will be the capturing groups of the regular expression. For example:
(\d\d\d\d)-(\d\d)-(\d\d) - (.*)
The syntax of these regular expressions is the one defined by the Java language (the documentation of the Java class Pattern lists the constructs of these expressions).
The section Examples of How to Define a Tuple Pattern below has several examples of data sources with Tuple Pattern.
To Process Fixed-Width Files
To process a file in which the values have a fixed-width, select Use fixed length and provide the following values:
Column widths (bytes). Comma-separated list of integers. Each integer is the column size in bytes. E.g. if the file has seven columns, you have to enter 7 integers separated by a comma.
Note this is the width in bytes, not characters. Take this into account when processing files that have multi-byte characters. For example, “例” only has one character but it is made of two bytes.
Pad character. In fixed-width files, a value is padded - usually with spaces - if it does not use all the bytes allotted to it (there is extra spaces). If in this file, the values are padded with a character that is not space, select Use custom value and enter this character. If you enter more two or more characters, all of them will be considered pad characters.
Replace character. The character that the data source returns for values that cannot be represented in the specified charset. For most files, you can leave Use default value (whitespace) selected. If you enter a custom value, enter a single character.
This option rarely applies except in situations where the file has an unexpected encoding. Let us say the data source has column width = “1,1” (two columns of one byte each), you selected the encoding “UTF-8”, and when the data source reads the file the first character of a value is multi-byte. In this case, the first byte of a multi-byte character may not correspond to any character and if this happens, the data source will replace this with a space or the character you entered.
Alignment. The alignment of the values (left or right) when the values have padding.
End of line delimiter. Character string to be used to mark the end of a tuple. The default value is
To indicate the end of line, use
\r\n, regardless of the operating system in which Virtual DataPort runs.
For fixed-width files, you can leave this parameter empty. If there are no end of lines in this file, leave this box empty.
Start of data zone delimiter. Java regular expression identifying the position in the file where the data source has to start retrieving data (or obtaining the header if the Header option is selected). If empty, the search will start at the beginning of the file.
Start delimiter from variable. If selected, the “Start of data zone delimiter” will be considered the name of an interpolation variable that, at runtime, will contain the “Data zone delimiter”.
Include start delimiter as data. If selected, the text matching the “Start of data zone delimiter” expression will be included in the search space.
End of data zone delimiter. The data source will stop retrieving data from the file when it finds this string. If empty, the data source will continue retrieving data until the end of the file.
Include end delimiter as data. If selected, the text matching the “End of data zone delimiter” expression will be considered in the results.
Header. If selected, the data source considers that the first line of the data region contains the names of the fields in this file. These names will be the fields’ names of the base views created from this data source.
Header pattern. Java regular expression used to extract the name of the fields that form the header. This only needs to be specified if the header has a different structure than the data. This option can only be used when the “Header” option is selected.
Ignore matching errors. If selected, the data source will ignore the lines of this data file that do not have the expected structure. I.e. rows that do not have the expected number of columns or, if you are providing a tuple pattern, rows that do not match the pattern.
If you clear this check box, the data source will return an error if there is a row that does not have the expected structure. When you select this check box, you can check if the data source has ignored any row in a query. To do this, execute the query from the Administration Tool. Then, click “View execution trace” and click the “Route” node. You will see the attribute “Number of invalid tuples”.
In the Metadata tab, you can set the folder where the data source will be stored and provide a description.
When editing the data source, you can also change its owner by clicking the button .
Click Save to create the data source.
Then, click Create base view to create a base view associated with the new data source. If the path to the data file includes interpolation variables, you will have to provide a value for them (the section Paths and Other Values with Interpolation Variables explains how to create paths to files with variables).
The Tool will display the schema that the base view will have. At this point, you can change the name of the view and the name and type of its attributes. In the Metadata tab, click Browse to select the folder where the new base view will be stored. Then, click Save ().
In the Server Explorer, double-click the new base view to display its schema. Click Edit to open the edition wizard of the view. In this wizard, you can change the name and type of the base view.
Examples of How to Define a Tuple Pattern¶
This section contains two examples of delimited-file data sources that are defined with a Tuple Pattern. Although in these examples we use a static value, the value of Tuple Pattern can be an interpolation variable (see the section Paths and Other Values with Interpolation Variables for more information about interpolation variables).
Example 1 of tuple pattern: Let us say that we have a file that
contains product information in the following format (note that the
discount attribute is optional):
product_name=Acme Laptop Computer;price=1500 euro;discount=50 product_name=Acme Desktop Computer;price=1000 dollar
The following pattern can be used to extract from each row the following information about each product:
Price and currency
Discount. For the tuples without a discount value, the value of this cell will be
Example 2 of tuple pattern: Let us say that we want to extract the
name of the files (not directories), its date and its size, from the
output of the Windows command
11/07/2007 10:10 <DIR> .dbvis 09/18/2008 15:09 <DIR> .eclipse 01/19/2011 16:55 <DIR> .gimp-2.6 11/10/2009 18:43 215 .ITLMRegistry 03/26/2010 14:16 3.498 .keystore 05/18/2010 17:56 <DIR> .m2 02/02/2010 15:23 <DIR> .maven 03/26/2010 14:01 <DIR> .netbeans 02/02/2011 19:20 <DIR> .smc_cache 06/15/2010 09:59 <DIR> .ssh 10/14/2009 13:26 <DIR> .thumbnails 02/15/2010 12:06 0 .Xauthority 01/16/2008 12:02 517 ant.install.log 02/11/2010 13:29 <DIR> Application Data 07/16/2010 08:51 772 build.properties 02/18/2008 15:19 <DIR> Contacts 01/14/2011 10:02 190 default-soapui-workspace.xml 02/07/2011 11:44 <DIR> Desktop 04/01/2009 15:11 <DIR> Favourites 08/22/2008 12:50 <DIR> Start Menu 01/27/2011 17:18 <DIR> My Documents 02/12/2009 12:36 201 osdadmin.ini 01/14/2011 10:02 7.958 soapui-settings.xml 02/09/2010 10:02 22.358 temp.txt 02/10/2011 09:22 <DIR> Tracing 03/05/2010 09:41 0 vdpws.log 04/17/2009 09:49 <DIR> workspace
The “Tuple pattern” has to be:
The base views created with this tuple pattern will have three fields: the date of the file, its size and its name.
Paths Using Date Ranges¶
When you need access to files that use a date-based naming convention
(as is typical in log files), use the
function to consider only the files between a given start date and a
given end date.
The syntax of the
^DateRange function is the following:
^DateRange ( pattern of the date range : text , start date : text , end date : text , pattern of the files : text )
Pattern of the date range: pattern of the parameters
end date. This pattern follows the syntax specified by the class SimpleDateFormat of the Java API.
yyyy-MM-ddis the pattern for <year (4 digits)>-<month in year (2 digits)>-<day in month (2 digits)>. E.g., “2014-01-31”.
This parameter can be a literal or an interpolation variable.
Start date: initial date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.
End date: finish date of the date range. This value has to follow the pattern specified in the first parameter of the function. It can be a literal or an interpolated variable.
Pattern of the file names: pattern of the file names. This parameter can be a literal or an interpolated variable.
When using this function, follow these two rules:
The literal parameters of the
DateRangefunction have to be surrounded by double quotes. With this function you cannot use single quotes.
You cannot leave any space between the parameters of the
Let us say that the
C:/logs directory contains the log files
generated daily by an application. The name of these files follows the
To create a data source that reads all the logs of January 2014, set the path of the data source to the following:
Note that the value of the first parameter (
yyyy/MM/dd) is the
pattern followed by the start date and end date parameters of the
A base view created over this data source will return the data “in
order”. That is, the data source reads the file of the data range, which
is the first of January (
reads the file of the next day (
then the file of the next day (
At runtime, if one of the files of the date range is missing and “Ignore route
errors” check box is not selected, the query will return an error but it will
also return the contents of all the files it found. For example, if the file
application_log_2014-01-15.log does not exist, a query to this data source
will return an error message that explains the situation, but the result will
contain the data read from all the other files. I.e.
In the previous example, the date range of the files was fixed and it
could not be changed at runtime. If you want this range to be dynamic,
you can set the
start date and
end date parameters to be
interpolation variables. E.g.
The base views created over this data source will have two extra fields
end_date whose value will set the date
range. For instance, the following query
SELECT * FROM bv_application_log WHERE start_date = '01/01/2014' AND end_date = '03/31/2014'
will make the data source to process the log files of the first quarter of 2014.
Note that when a parameter is an interpolation variable, you do not have to add double quotes.
^DateRange can also be used with paths that point to directories
instead of a group of files.
For instance, if the logs of each day are stored in a separate directory
with the naming convention
yyyyMMdd, set the path of the data source
The base views created over this data source will read all the files from every directory in the specified date range.