Retrieving data from a single page

The previous wrapper introduced automated browsing through pages. That is the first half of what a web automation wrapper usually does. The other big task found in common wrappers is extracting information from a page and giving it structure. In this example we will build a wrapper that browses to the 511.org website at http://traffic.511.org/parking/index#SEARCH?CIT=San+Francisco&SRT=Availability and pulls realtime information about parkings in the San Francisco Bay area.

  1. Create a new wrapper named "sfparkinginfo".

  2. Add a Sequence component that navigates to http://traffic.511.org/parking/index#SEARCH?CIT=San+Francisco&SRT=Availability.


  3. The component used to retrieve information from a web page is called Extractor. Drag an Extractor component onto the workspace and link it after the Sequence component.

  4. The Extractor component will extract data from the page that we navigated to, so set the output of the Sequence component (Sequence_1_output) as the "Input page" field of the inputs of the Extractor_1 component.

  5. Let's focus now on the MSIE browser. At the left side of the page we see a set of parkings with their current state, in this format:

    Garage
    Ellis OFarrell Garage
    123 O'Farrell Street
    Space Available: 84% (577/691)
    Auto Pricing :
    $1.00 - 6:00 PM - 12:00 AM Per hour
    $2.00 - 12:00 AM - 9:00 AM Per hour
    $3.00 - 9:00 AM - 12:00 PM Per hour
    $3.50 - 12:00 PM - 3:00 PM Per hour
    $3.50 - 3:00 PM - 6:00 PM Per hour
    $16.00 - Sunday Daily Max / Lost Ticket
    $19.00 - Early Bird: Flat rate. Mon-Fri: Enter before 8:30am and exit before close
    $34.00 - Daily Maximum / Lost Ticket
    Motorcycle Pricing :
    $7.00 - Motorcycle: Flat rate

    What we are going to do is break this information in different fields, so we have a more structured view of the data present in the page. Specifically, we will create these fields:

    garageEllis OFarrell Garage
    123 O'Farrell Street
    space_available83% (569/691)
    auto_price $1.00 - 6:00 PM - 12:00 AM Per hour
    $2.00 - 12:00 AM - 9:00 AM Per hour
    $3.00 - 9:00 AM - 12:00 PM Per hour
    $3.50 - 12:00 PM - 3:00 PM Per hour
    $3.50 - 3:00 PM - 6:00 PM Per hour
    $16.00 - Sunday Daily Max / Lost Ticket
    $19.00 - Early Bird: Flat rate. Mon-Fri: Enter before 8:30am and exit before close
    $34.00 - Daily Maximum / Lost Ticket
    moto_price$7.00 - Motorcycle: Flat rate

    We can do this because there are new lines and other characters separating all these bits of information, so ITPilot will use those separators to split the data into the specified fields.

  6. ITPilot learns how and what to extract from a page from examples given by the user. To do so, click the toolbar button labeled Assign Examples to enter into the extractor mode.

    Remember to stop the recording to see the Assign Examples button.

    TIP
  7. If you review the parking list you can see that some parkings lack some of the fields. Pick one parking that has all the fields we want to extract so we set a baseline with all the information present.

  8. Highlight the text below the "Garage" word from the parking that you selected, then right-click the selection. This will bring a contextual menu for assigning examples; in our case, select New example and in the submenu that appears select New field.

  9. The dialog for new field creation will be displayed. Type the name of the field ("garage") leave the rest of options with the default value and click "Ok".

  10. Do the same for the other three fields. Highlight the text for the "Space Available", then right-click and select "Example 1 > new field" and type "space_available"; do the same for the "Auto Pricing" and "Motorcycling Pricing" fields.

    Make sure that you assign all these fields to the "Example 1". We only use New example for the first field, but the rest of fields should be assigned to the same record as the first.

    TIP
  11. Now we have assigned an example of the data we want to extract. If you right-click and select the "Example 1" field you will see the values for all assigned fields.

  12. As we saw in our first inspection of the page, some parkings lack some parts of the data. For example, some of them do not display the "Motorcycle Pricing" field. We need to provide ITPilot examples of this type of parking. If there were more examples of parkings which lack some other fields we would need to provide ITPilot examples of these types of parkings so it can generate a wrapper that extracts all the different combinations of optional fields.

  13. Select a parking that lacks the "Motorcycle Pricing" field but has all the others, and repeat the steps 8, 9 and 10 for the new parking. Notice that you need to select New example for the first field and then Example 2 for the rest of them.

    Do not assign the values to the first example! ("Example 1").

    Example 2 will not have any value assigned to the "Motorcycle Pricing" field. Do not use the Example 2 > new field option for this example; you are assigning the same fields that were created for the previous example but for a different record, so select Example 2 > garage, Example 2 > space_available, etc.

  14. Now that we have all examples we need, go back to the WGT. Double-click on the Extractor component to bring up its configuration wizard. At the top left corner of the dialog there is a button labeled Import from browser. Click it to transfer the examples from within the browser into the Extractor's wizard. When you do so, you will see that the "Generation" pane brings a wait animation while it generates a pattern to extract all the data from the page with the specified structure.

    If at this point you get a dialog notifying you about an autogenerated lexer, click "Yes" to continue.

    NOTE

    The extraction process in ITPilot is expressed in DEXTL (Denodo EXTraction Language). This language specifies how to give structure to semi-structured information found in the web page by the application of patterns (similar to regular expressions). These DEXTL specifications are usually generated for us automatically from a set of examples, but in a lot of cases it is useful to write or modify them by hand.

    Check the ITPilot documentation for additional information about DEXTL; if you want more information our ITPilot training courses offer classes that teach all the details of DEXTL.

    NOTE
  15. After a while, the Extractor wizard should had generated the specification and should be ready for testing. Click on the Specification test button and then Refresh on the new pane that appears. Doing so will test the extraction process on the current page, and should display on the table below the button all the parkings from the page.

  16. Click "Ok" to save the extractor component's configuration.

  17. Now we need to return the output of the Extractor component as the result of the wrapper. The easiest way of doing this is to add an Output component to the wrapper. Drag it onto the workspace now.

  18. Link the Output component after the Extractor component, and link the End component to the Output component.

  19. Select the Extractor component output ("Extractor_1_output") in the "Input records" input field of the Output component. Note that you may have to click the "+" icon to the right of the "Input records" field to enable the selector.

    The Output component has two mutually exclusive input fields. "Record" represents a single record to send to the output. "Input records" represents a list of records to send to the output, one by one.

    In this case, the Extractor component returns a list of records so to send them to the output we use the "Input records" input of the Output component. If you try to use the "Record" input, the "Extractor_1_output" value will not appear.

    NOTE
  20. Save the wrapper and test it. You will see that a new browser window appears, navigates to the parking information page and after a while all the data present in the page should appear in the results tab of the test wrapper dialog.

In this section we have created a web wrapper that navigates to a page and retrieves information from that page in real time, making useful work and allowing us to actually use web data in our web integration solutions. In the next sections we will review more advanced usages of ITPilot for dealing with more complex scenarios.