Text Processing Functions¶
Text processing functions have the objective of executing a transformation or calculation on a text-type attribute or constant.
CONCAT
: The concatenation function receives a variable number of arguments and allows astring
-type element to be obtained as a result of concatenating its parameters. The infix version of this function receives two arguments and is represented by the symbol ‘||’.CONTEXTUALSUMMARY
: This function obtains a contextual summary of a text based on a keyword search. A series of text fragments containing the word or sentence specified is obtained. The function has the following signature:CONTEXTUALSUMMARY(content:string, keyword:string, [beginDelim:string, endDelim:string, fragmentSeparator:string, fragmentLength:int [,maxFragmentsNumber:int]])
, where:
content
: text to analyze and the one from which the most relevant fragments are to be extracted (mandatory)keyword
: the keyword used to extract the text fragments (mandatory). The content of this argument can be a single word, or a sentence.beginDelim
: text to add as prefix of the keyword whenever it appears in the text (optional, default value is “”).endDelim
: text to add as suffix of the keyword whenever it appears in the text (optional, default value is “”).fragmentSeparator
: text to use as separator of the different text fragments obtained as a result (optional, default value is “…”)fragmentLength
: approximate number of characters that will appear before and after the keyword occurrences inside of the text (optional, default value is 5).maxFragmentNumber
: maximum number of fragments to retrieve.analyzer
: analyzer to use when performing the keywords search. By default, the Standard Analyzer (std
) is used: this analyzer does not consider lemmatization or stopwords. Analyzers for English (en
) and Spanish (es
) are also included.
GETBYTES
: This function receives 2string
-type arguments and returns the result of transforming the text received in the first argument to a byte array, using the encoding specified in the second argument. If the text to transform (first argument) is null then the function returns null. If the specified encoding is null (second argument), then the JVM default encoding will be used.INDEXOF
: TheINDEXOF
function receives twostring
-type parameters and returns the index of the first appearance of the second string inside the first string, or -1 if the second string is not contained inside the first. The return type is integer.LEN
: The LEN function receives as a parameter astring
-type argument and returns the number of characters that form it. Alternatively, it accepts as parameter a binary-type argument and returns its size in bytes.LOWER
: This function receives astring
-type argument and returns it to the output with all of its characters changed to lower case.REGEXP
: This function allows for transformations on character strings based on regular expressions. It is given three arguments: onestring
-type element, one input regular expression and one output regular expression. The regular expressions must be expressed using Java Regular Expressions. The function behaves in the following manner: The input regular expression is assessed against the text from the first argument and the output regular expression may include the “groups” defined in the input regular expression. The portions of text matching them will be replaced in the output expression. For example, the result of evaluating:REGEXP('Shakespeare, William','(\\w+), (\\w+)','$2 $1')
will be “William Shakespeare”.
REMOVEACCENTS
: This function receives astring
-type argument and returns that same argument value but with no accents.REMOVEWHITESPACES
: This function receives astring
-type argument and returns that same argument value but with no blanks.REPLACE
: This function receives 3string
-type arguments and returns the result of replacing the occurrences of the second one in the first one by those of the third one.SIMILARITY
: This function receives two character strings and returns a value between 0 and 1, which is an estimated measurement of similarity between the strings. The thirdstring
-type parameter (optional) specifies the algorithm to use to calculate the similarity measurement. ITPilot includes the following algorithms (if no algorithm is specified, ITPilot chooses the one to apply):1. Based on the editing distance between the text strings:
ScaledLevenshtein
,JaroWinkler
,Jaro
,Level2Jaro
,MongeElkan
,Level2MongeElkan
.Based on the appearance of common terms in the texts:
TFIDF
,Jaccard
,UnsmoothedJS
.Combinations of both:
JaroWinklerTFIDF
.
SPLIT
: The split function takes two string-type arguments. It splits the second argument around matches of the regular expression given as the first argument, and returns an array containing the generated substrings.SUBSTRING
: The substring function receives as parameters astring
-type argument and two integer numbers. It returns as output the part of the substring of the first argument that corresponds to the positions indicated by the second (beginning) and third (end) arguments. The result string contains all the characters from the beginning up to the previous character to the end index: the end index marks the first character that is not included in the output.TRIM
: This function receives astring
-type argument and returns the same argument with all the spaces and carriage returns removed from the beginning and the end of the string.UPPER
: This function receives astring
-type argument and returns it to the output with all of its characters changed to upper case.