Downloading files and images

Another common task that we find is to download files from a webpage. There are two possible scenarios for this task: downloading files when we know their URL (the most common case are images or documents linked directly from a page) and downloading files that are served dynamically (the most common case being files that are generated after clicking "submit" on a form).

Download files with known URL's

Let's see the first case: downloading images. If we access this web page http://static.cbslocal.com/cbs/kpix/kpixcamsite/sanfran.html# we can see the current status of traffic webcam around San Francisco. We will create a wrapper that accesses this page and stores those images, so at any point we can run the wrapper and show the user the traffic situation.

  1. First create a new wrapper named "sftrafficcams".

  2. Add a Sequence component and record a navigation sequence to the URL http://static.cbslocal.com/cbs/kpix/kpixcamsite/sanfran.html#.

  3. Add an Extractor component and link it after the Sequence component (remember to use the output of the Sequence component as the input page of the Extractor, as usual).

  4. In the browser (positioned at http://static.cbslocal.com/cbs/kpix/kpixcamsite/sanfran.html#) enter the assign examples mode.

  5. Add a new example with one field named 'imgurl'. We want to extract the value of the "src" attribute of each image, so remember to select the "Markup's attribute value" checkbox when creating the new field of the example. Set the type of this field as "URL", so we get back absolute URLs instead of relative URLs.



  6. Add a second example, for one of the small images.

  7. Import the examples into the Extractor component and test that we get back our urls.

  8. What we have is a list of URLs. Now we need to iterate over all URLs to download each image individually. For this, we add an Iterator component after the Extractor component and set the Extractor_1_output as the only input of the Iterator component.

  9. Now we will add a new Sequence component to navigate to each image. Link it as the first component inside the Iterator component. Add the output of the Iterator component (Iterator_1_output) as an input value of the Sequence component.

    In the Sequence wizard configure the Sequence Type as "Denodo Browser" (do this always for downloading binary files) and set the navigation sequence to the following:

    Navigate(@imgurl,1);

    (assuming you named the field in your examples "imgurl").

  10. Click "Ok". Now we have a Sequence component that retrieves the binary data of each file.

  11. To store each image on disk we will use a Save File component. Drag a Save File component to the workspace and link it after the Sequence_2 component. Set its "Input page" input as "Sequence_2_output". This will make the component to save each image in the default folder (check the Denodo Platform documentation to know how to configure a folder as the default download folder) with an autogenerated name (so there are no name collisions between different wrapper executions).

  12. Now let's return as result of the wrapper the URLs and local file paths for each image. The output of the Save File component is a record with one field, the local path. To add the URL of the saved image we will use a new Record Constructor component after the Save File. Set its inputs as both Iterator_1_output and Save_File_1_output.

  13. Open the RecordConstructor wizard and click on the "+" icon on the right side of all the fields, so we add all of them to the output of the component. Rename the second field to "filepath" and click "Ok".

  14. Add an Output component that receives as input the Record_Constructor_1_output value, so we output the records as results of the wrapper.

When executing this wrapper we will see that it access the web site and returns four records, with the URLs of the original images and the local file paths where they were downloaded. You can open those files with any image viewer and check that indeed the binary data was downloaded correctly.

Download files with unknown URL's

The other case that is interesting is when we have a file that gets generated for us after submitting a form. In this case, we don't have a URL for the file beforehand, so we cannot use an Extractor component to retrieve it. We must then use a different mechanism to download the file. Let's see how it works by retrieving US census information from this website http://census.ire.org/data/bulkdata.html.

  1. Create a new wrapper named "censusdata".

  2. Add a Sequence component and link it from the initial component.

  3. Open a new MSIE browser window (remember to do it from the Browser > New browser menu option!) and record a navigation sequence to http://census.ire.org/data/bulkdata.html .

  4. Before stopping the sequence, right click on the "Download CSV" file and select Click and... > Save.

  5. You will get the usual MSIE download dialog; select a folder and flie names and click "Save".

    Note that as specified in the census page we are going to download the data gzip-compressed.

    TIP
  6. After the file is saved to disk, stop the recording and import it in your Sequence component back in the WGT.

  7. Save your wrapper and test it to download the file again.

In this case we have used ITPilot to download a file that has an unknown URL. This is also a common scenario, and we can solve it with ITPilot in many ways. If you want to know more options for saving files in these scenarios we recommend you to take our in-depth training course in web automation that explains all the different ways of solving this problem.

Our next wrapper will show how to extract information that is presented in two levels.