Web Automation

Extracting information from web sources and giving it structure

Why web integration?

In the current information technologies landscape there are a lot of locations that can act as sources of valuable data. Traditional enterprise sources include relational databases, data warehouses and web services among others. Data consolidation helps reducing the number of data silos whereas Data Virtualization can help to bridge the gaps where consolidation is not possible or desirable.

In some cases, data is siloed in external sources that we have no control over. In these scenarios our only choice is to consume the data in the format that is presented to us. Normally we will see this external information provided through APIs, but sometimes the only way of accessing it is using a web browser to access a web page - the data is meant to be consumed by humans, not machines. The amount of information that can be found in these formats is very high and it can provide a lot of value to most companies, either by itself or when used to enrich the data within the datacenter.

With traditional web scraping techniques the cost of actually get this data is usually higher than the real value that we get from it, as they are difficult to apply and maintain, with the resulting situation where most companies never get to implement a successful solution. Denodo ITPilot changes the balance by lowering the price of entry so that it becomes not only feasible but easy to integrate our internal company data with externally maintained information.

This tutorial will introduce Denodo ITPilot and show how to use it for performing several common tasks found in real-world Web Integration projects.

Always take into account and respect the terms and conditions of any web site involved in your web integration projects.

NOTE

These tutorials use external websites for educational purposes, and the wrappers worked as intended at the moment of the creation and last revision. As the sites that the wrappers access are not under control of Denodo we do not guarantee that the examples will be working at all times, due to changes to these pages by third parties.

NOTE

What is ITPilot?

Denodo ITPilot (ITP for short) is the Web Integration component for the Denodo Platform. Using ITPilot we can create a web wrapper that will access a specific website for the purpose of extracting information (usually, although we can use ITPilot to do anything a user would do in a web browser) and then import said wrapper into the Denodo Platform as a data source. Once this is done we can then combine the data from the web with other views that we may have created within Denodo.

These web wrappers are created graphically using the integrated development environment that ITPilot provides. When deployed in the Denodo Platform they will retrieve and return information from the specific web site in real time.

ITPilot architecture

ITPilot usage is split in two scenarios: development and execution. We first use the development tools to tell Denodo what website we want to access, what steps need to be performed, what format the information is in the page and how to extract it, what transformations we want to apply, etc.

ITPilot makes two tools available for these wrapper creation tasks:

  • The ITPilot Wrapper Generation tool (WGT) is a desktop client that will be your main development tool for the creation of wrappers in a graphical way.
  • The MS Internet Explorer toolbar: this is an IE add-on that enables the user to record actions in the browser and have them automatically imported into the wrapper being developed.

Once the development and testing of a wrapper is finished the wrapper is ready to be deployed in the Denodo server. Alongside with it we will also need to start the Browser Pool server, which will manage the browser instances used to retrieve the data in real time during the normal operation of the web integration solution. ITPilot also has an Administration tool that we will not use in these tutorials, please check the ITPilot documentation for more details.

In this tutorial you will learn: