Retrieving data from a single page

The previous wrapper introduced automated browsing through pages. That is the first half of what a web automation wrapper usually does. The other big task found in common wrappers is extracting information from a page and giving it structure. In this example we will build a wrapper that browses to the JKF airport website at https://www.airport-jfk.com/arrivals.php and pulls realtime information about arrival of fights in the JKF airport.

  1. Create a new wrapper named "jkfarrivalinfo".

  2. Add a Sequence component that navigates to https://www.airport-jfk.com/arrivals.php.


  3. The component used to retrieve information from a web page is called Extractor. Drag an Extractor component onto the workspace and link it after the Sequence component.

  4. The Extractor component will extract data from the page that we navigated to, so set the output of the Sequence component (Sequence_1_output) as the "Input page" field of the inputs of the Extractor_1 component.

  5. Let's focus now on the MSIE browser. In this page we could see the list of flight arrival detials along with their current status, in table format. We are going to create field for this and extract the data from the web page.

  6. ITPilot learns how and what to extract from a page from examples given by the user. To do so, click the toolbar button labeled Assign Examples to enter into the extractor mode.

    Remember to stop the recording to see the Assign Examples button.

    TIP
  7. Pick the one fight arrival that has all the fields we want to extract so we set a baseline with all the information present.

  8. Highlight the text below the "Origin" word from the flight arrival list that you selected, then right-click the selection. This will bring a contextual menu for assigning examples; in our case, select New example and in the submenu that appears select New field.

  9. The dialog for new field creation will be displayed. Type the name of the field ("origin") leave the rest of options with the default value and click "Ok".

  10. Do the same for the other five fields. Highlight the text for the "airline", then right-click and select "Example 1 > new field" and type "airline"; do the same for the "flight", "arrival", "terminal" and "status" fields.

    Make sure that you assign all these fields to the "Example 1". We only use New example for the first field, but the rest of fields should be assigned to the same record as the first.

    TIP
  11. Now we have assigned an example of the data we want to extract. If you right-click and select the "Example 1" field you will see the values for all assigned fields.

  12. As we saw in our first inspection of the page, some flight arrivals lack some parts of the data. For example, some of the "terminal" field has no value. We need to provide ITPilot examples of this type of field. If there were more examples of flight arrival information which lack some other fields we would need to provide ITPilot examples of these types of field so it can generate a wrapper that extracts all the different combinations of optional fields.

  13. Select the "terminal" field which has lack of data and repeat the steps 8, 9 and 10 for the new flight arrival information. Notice that you need to select New example for the first field and then Example 2 for the rest of them.

    Do not assign the values to the first example! ("Example 1").

    Example 2 will not have any value assigned to the "terminal" field. Do not use the Example 2 > new field option for this example; you are assigning the same fields that were created for the previous example but for a different record, so select Example 2 > origin, Example 2 > flight, etc.

  14. Now that we have all examples we need, go back to the WGT. Double-click on the Extractor component to bring up its configuration wizard. At the top left corner of the dialog there is a button labeled Import from browser. Click it to transfer the examples from within the browser into the Extractor's wizard. When you do so, you will see that the "Generation" pane brings a wait animation while it generates a pattern to extract all the data from the page with the specified structure.

    If at this point you get a dialog notifying you about an autogenerated lexer, click "Yes" to continue.

    NOTE

    The extraction process in ITPilot is expressed in DEXTL (Denodo EXTraction Language). This language specifies how to give structure to semi-structured information found in the web page by the application of patterns (similar to regular expressions). These DEXTL specifications are usually generated for us automatically from a set of examples, but in a lot of cases it is useful to write or modify them by hand.

    Check the ITPilot documentation for additional information about DEXTL; if you want more information our ITPilot training courses offer classes that teach all the details of DEXTL.

    NOTE
  15. After a while, the Extractor wizard should had generated the specification and should be ready for testing. Click on the Specification test button and then Refresh on the new pane that appears. Doing so will test the extraction process on the current page, and should display on the table below the button all the flight arrival information from the page.

  16. Click "Ok" to save the extractor component's configuration.

  17. Now we need to return the output of the Extractor component as the result of the wrapper. The easiest way of doing this is to add an Output component to the wrapper. Drag it onto the workspace now.

  18. Link the Output component after the Extractor component, and link the End component to the Output component.

  19. Select the Extractor component output ("Extractor_1_output") in the "Input records" input field of the Output component. Note that you may have to click the "+" icon to the right of the "Input records" field to enable the selector.

    The Output component has two mutually exclusive input fields. "Record" represents a single record to send to the output. "Input records" represents a list of records to send to the output, one by one.

    In this case, the Extractor component returns a list of records so to send them to the output we use the "Input records" input of the Output component. If you try to use the "Record" input, the "Extractor_1_output" value will not appear.

    NOTE
  20. Save the wrapper and test it. You will see that a new browser window appears, navigates to the flight arrival information page and after a while all the data present in the page should appear in the results tab of the test wrapper dialog.

In this section we have created a web wrapper that navigates to a page and retrieves information from that page in real time, making useful work and allowing us to actually use web data in our web integration solutions. In the next sections we will review more advanced usages of ITPilot for dealing with more complex scenarios.