Accessing detail pages

Our last wrapper example is going to show how to extract information that is spread across two levels. The most common case is a web page that displays a list of results and each item contains a link to a second page which shows additional details for that item. To showcase this, we are going to build a wrapper that navigates to a household items database found at http://householdproducts.nlm.nih.gov/advancedsearch.htm, does a search for a product or a family of products, and extracts all the information about them. This can be useful, for example, for a self-service customer service app that retrieves the manufaturer's phone number in case of a problem.

  1. Create a new wrapper named "householditems".

  2. Declare one mandatory input parameter of type string named productName in the configuration of the init component.

  3. Add a Sequence component that navigates to http://householdproducts.nlm.nih.gov/advancedsearch.htm and performs a search in the main product search box (for example, search for "glue"). Remember to configure the Sequence component to accept the Start_1_output value as an input parameter, and to use the same variable name for the search term than you specified as parameter name in the init component configuration.

  4. Add a new Extractor component and set it to consume the Sequence_1_output value as input page.

  5. Enter the Assign Examples mode in the browser and assign several examples of products, retrieving the name for each one.

  6. Now we have to tell ITPilot how to access the detail page for each product. While in assign examples mode, click on Rec to record the detail page navigation. We will record it for a single item and ITPilot will automatically analyze it to apply it to all elements.

    When clicking Rec we will get a new dialog. Specify a descriptive name for the sequence (for example, "detailsequence") and then select the first element that we assigned in the previous step. This is the item we are going to record the sequence for.

    Click Ok when done.

  7. We are now in sequence recording mode, so we just issue a click on the product that we are recording the sequence for. When clicking Stop, ITPilot will tell us that it's going to generalize the sequence for all items:

  8. The next step is to import the sequence into the Extractor, back in the WGT. Double-click on the Extractor component and click Import data from browser.

    Click "Ok".

  9. The last thing we need to do for the extractor is to limit the search of items to the actual results table. In the browser, highlight the text "Search results" and in the Extractor wizard click on From.

  10. Now it's turn to add an Iterator component to iterate through the list of results extracted from the main page. Link it after the Extractor and set the Extractor_1_output value as the input of the Iterator component.

  11. For each element we are going to navigate to the detail page of that element. To do that, we use the Extractor Sequence component. Drag a new ExtractorSequence component and link it as the first component of the iteration.

    Set the following inputs of the component: Iterator_1_output as the "Extractor record" and "Extractor_1" as the "Related extractor".

  12. At this point the wrapper extracts all the products and navigates to the detail page of each product. Now it's turn to extract the detail data from each product. What we are going to do is retrieve all key/value pairs from the product data tables, so we have all information available for further processing.

    Add a second Extractor component to the wrapper, linked after the Extractor_Sequence_1 component. Set the Extractor_Sequence_1_output value as the "Input page" input of the new Extractor.

  13. Back in the browser, let's assign the examples for the new page. First, make sure that you're in the Assign Examples mode, and then let's delete the data from the previous extraction process (so we start with a blank state). Right-click on the page and select Delete All.

    Click on "Yes".

  14. Assign several examples with two fields: the first one named "dataField", which is the value of the first column of the table, and the second one named "value", which contains the value of the second column of the table. For example, have these values for the first example:

    When selecting the first column's value, note that all items end in a colon. We don't want to extract that, so make sure that you select all the data without the colon. For example "Product Name" instead of "Product Name:".

    TIP
  15. After adding two or three examples of fields to extract, import the data back in the WGT. Double-click on the Extractor_2 component to open its wizard and then click on Import data from browser. If we test the specification we should get something like this:

    Click "Ok" to close the Extractor_2 wizard.

  16. The next step is to put together a record with the item data we extracted with each of the Extractor components. The component to use for this task is the Record Constructor. Drag one onto the wrapper and link it after the Extractor_2 component.

    The record constructor needs the data from the current product being analyzed and the detail data for said product. Those are represented by Iterator_1_output and Extractor_2_output respectively, so add two input slots to the Record_Constructor_1 component and configure them with those values.

  17. Now we need to open the Record Constructor wizard (double click on the component to do so) and select which of the inputs' fields we are going to select to appear in the output (all of them in our case) and to create any derived field we may want (none, in this case).

    Click on all the "+" icons on the right of the two fields, rename the second to "data" and click "Ok" to close the wizard.

  18. Finally, the only thing remaining is to output the result of each item. Add an Output component as the last component of each iteration, and configure its input value to be the output of the Record Constructor component.

And that's it! When testing the wrapper, it will navigate to the main search page, search for the specified term, then extract all the products found. For each product, it will navigate to its detail page, then retrieve all the relevant information, and return a combined row with all the data for that product.

After these steps, we have learned how to create a wrapper that extracts information from items that are spread over different pages. This concludes our basic introduction to creation of wrappers, but this is just the beginning. To learn more, from more advanced data extraction patterns such as multi-level extraction, to optimization techniques for making the wrappers execute much faster, to debugging procedures, etc. review our training offering on Web Automation courses.