Pagination of results

Let's build now a wrapper that performs another very common action: web searches. We will use the Yahoo search page (http://search.yahoo.com/) for this purpose. This new wrapper will receive a search term as a string and return the search results found by Yahoo search.

The main issue that we will illustrate in this wrapper is pagination of results. We will find that in a lot of web sites the data we are interested in is spread across several pages; in theory we already know enough about sequences to manually create a combination of components to achieve the navigation through the different pages of results but ITPilot provides a built-in component that simplifies the task and makes dealing with pagination much easier: the NextIntervalIterator.

Let's see how we can configure it:

  1. Create a new wrapper named "yahoosearch".

  2. In the same way we did in the previous wrapper, declare a wrapper input parameter named searchterm of type string.

  3. Record a navigation sequence to http://search.yahoo.com/ that uses the searchterm variable for filling out the search box and clicks on the search button.

  4. Add a Sequence component to the wrapper, link it up to the Start component, set up the input variable from the Start component and import the navigation sequence from the browser.

  5. Now our wrapper should look like this:

    and be capable of receiving a search term and execute said search in Yahoo.

  6. The next bit is to tell ITPilot that our search results are spread across different pages. To do this, we add a new NextIntervalIterator component and link it after the Sequence component.

  7. You can see that the NextIntervalIterator is a loop-type component that has two nodes. Remember that the components we link between the two nodes will be executed on each iteration; in this case, each iteration of the NextIntervalIterator represents each page of results.

  8. Configure the NextIntervalIterator component to receive the Sequence_1_output value as the "Input page" input field. That way the navigation through the different pages will be performed in the same browser window than the original navigation.

  9. Now it is time to manage the actual navigation sequence that we will use for going from one page to the next. Double-click the NextIntervalIterator component to open its wizard.

  10. The wizard shows one configuration field named "Pagination type". This pagination type refers to the style of pagination displayed in the page:

    • Constant: used when the page has a link that never changes and that takes us to the next page. Usually displayed as a "Next" link or a right-pointing arrow.

    • Single block: some other times the pages display a numbered index of pages, each number being a link that takes us to that page. For example, a sequence of links 1,2,3,4,5,6...

      In this case having a constant sequence in the wrapper will not work (we don't want it to always click on the link '2', for example) but the single block pagination style in the NextIntervalIterator will generate a varying sequence for us automatically.

    • Multiple block: this is the most complex case, where sometimes we find a page that displays a list of page numbers and a Next link, and by clicking the Next link it takes us to the next block of page numbers. For example, we can see

      1 2 3 4 5 Next

      and after clicking the "Next" link we will see:

      6 7 8 9 10 Next

  11. The Yahoo search results page displays two of those patterns: constant navigation (because it has a "Next" link that takes us to the next page of results) and single block pagination (because it has a list of page number links).

    Let's set the wrapper to use constant pagination:

    Always try to use constant pagination if you can, as it is simpler and more robust than the single or multiple block schemas.

    TIP
  12. In the browser, we will record a navigation sequence to take us from page 1 to page 2. Click the Rec button, do not specify an initial URL, and click "Ok".

    When we record a navigation sequence without an initial URL, the resulting sequence will be relative to the point where the browser currently is.

    TIP
  13. Right-click on the "Next" link at the bottom of the page and select "Click".

  14. After the browser reaches the second page of results, stop the sequence and import it into the NextIntervalIterator component by clicking on the Import from browser button.

  15. Last, we have to tell the NextIntervalIterator component how many pages we want to traverse. For websites with a finite amount of results we normally want to go through all the pages of results, so we select the "Iterate until navigation fails" checkbox (because the last page of results does not usually have a "Next" link).

    Search engines are different as they normally provide us with a seemingly infinite amount of search results, so in these cases it is advised to uncheck the "Iterate until navigation fails" and manually specify a maximum number of pages to navigate. Set it to 5 for this example.

  16. Click Ok to close the NextIntervalIterator component's wizard and save the wrapper.

  17. At this point we can test the wrapper; by clicking on Test wrapper and providing a search term on the following dialog, we will see that the browser goes to Yahoo search, types the search term, clicks "Search" and then navigates through the first five pages of results.

  18. The wrapper does not do anything on each page, but it is trivial to add the components needed for extracting the search results. As in the previous wrappers, define in the browser a set of examples (each one could have two fields, url and synopsis), add an Extractor component, set its input page, import the examples from the browser, and add the output component needed to return the extracted values.

    The Extractor component must receive as input page the output of the NextIntervalIterator component, not the output of the Sequence component. The former represents the current page of results, and the latter the initial page of results.

    TIP

    With these steps we have created a wrapper that navigates through pages of results using the constant schema. Let's see how we would use the NextIntervalIterator for the case of a single block schema.

  19. Open the NextIntervalIterator component and select pagination type = single block.

  20. Go to the MSIE browser and place it on the first page of results.

  21. Record a relative navigation sequence that clicks on the "2" link.

  22. In the NextIntervalIterator wizard, click on Import from browser. The wizard will show as a message notifying us that the imported sequence matches a potential template for a single block schema, starting on the link numbered with 2 and increasing by 1 on each page. Click "Yes" and ITPilot will generate a variable navigation sequence that will click on each page link individually.

  23. Click "Ok" to close the NextIntervalIterator component and save the wrapper.

After steps 19-23 the wrapper should work exactly the same as after step 18, but instead of always clicking on "Next" to advance to the next page, it will click on each individual page link.