HTML attributes

So far we have created wrappers that extract text that is visible to the user but sometimes we want to retrieve information that is in the page but is not displayed to the user, instead being found in attributes of the HTML tags in the source code of the page. The most usual cases are URLs of links (found in the href attribute of the <a> tag) and URLs of images (specified as the src attribute of <img> tags).

Let's see an example: in our Yahoo search wrapper we were able to extract the links to the search results because Yahoo displays said URLs as text below the actual link. In most of the cases we will only have available the link, so let's see how to extract the URL in those cases:

  1. When assigning examples of the data being extracted, highlight the link or just right-click over it.

  2. Select New example > new field or Example XYZ > new field, depending if you're creating a new example or just adding a field to an existing example.

  3. In the dialog that we get, type the field name and select "Markup's attribute value". This will mark this field as being extracted from an HTML attribute.

    Use "string" data type if you want to extract exactly the contents of the attribute, or use "url" data type if you want ITPilot to automatically convert relative URLs to absolute URLs. This is useful if you want to later access the link from anywhere else.

    TIP
  4. Click Ok.

  5. You will get a new dialog that displays on the left column all the HTML tags found under the mouse or in the highlighted area. Select the "A" tag.

  6. Once you select a tag on the left column the center column will show all the attributes of that tag. Select the attribute "href" and click "Ok".

The rest of the wrapper is created in the exact same way as in the original Yahoo search example.

When creating fields in this way you can get the values of any HTML attribute present in the page. As commented above the most common uses are retrieving URLs of links and images, but any of the attributes can be accessed.