Environment Setup

Data Science and Machine Learning

Introduction

This document instructs you on how to create an execution environment for the Data Science and Machine Learning Tutorial, that you are about to start.

It uses Vagrant as the provisioning platform and VirtualBox as the hypervisor.

Once the environment is provisioned, you'll have a running virtual machine (VM) with the Denodo Platform installed as well as the needed data sources to be able to do the tutorial exercices.

The virtual machine can be managed independently from Vagrant, like any other. This means that you can stop and reboot it as you like with your modifications persisted in it.

Requirements

In order to provision the environment you will need a valid Denodo Platform 8.0 Standalone license. Nevertheless, this tutorial is visually detailed step-by-step and it was conceived to be followed completely just by reading it.

NOTE

Hardware

The VM should be assigned 12G of memory and about 25GB of storage.

Software

You can use the latest versions of Vagrant and VirtualBox. The VM has been tested in the following environment:

  • Runtime provider: VirtualBox 6.1.18
  • Provisioning platform: Vagrant 2.2.4
  • Host system: WIndows 10 Pro

Installing via Vagrant provisioning

Process Overview

Once you start the provisioning, Vagrant downloads to your machine a canonical virtual machine image to base the provisioning on. It then turns on that virtual machine as a VirtualBox guest and run the script specified in vagrant/install-files/setup.sh that performs all the installation and configuration operations needed to get the environment ready to be used.

Steps

  1. Download and install Vagrant. Instructions here.
  2. Download and install VirtualBox. Instructions here.
  3. Download the dstutorial release archive denodo_tutorial_data_science.zip. Extract it. This archive contains two files.
    • dstutorial-release-20210428.zip
    • denodo-systemd-services-release-20210408.tar
    They are described in the table below.
  4. Extract dstutorial-release-20210428.zip and navigate to the folder dstutorial-release-20210428.
  5. Copy the following files in vagrant/install-files/artifacts:
  6. Name of file Description
    Apache Zeppelin for Denodo - Standalone.zip Apache Zeppelin for Denodo installer, version must be 20210113
    denodo.lic Denodo Standalone license
    denodo-install-8.0-ga-linux64.zip Denodo Platform installer for Linux
    denodo-v80-update-20210209.zip Denodo Platform update 8.0-20210209
    denodo-systemd-services-release-20210408.tar Tar archive of setup for Denodo systemd services (already downloaded in step 3)
    dstutorial-release-20210428.zip Zip archive of repository dstutorial (already downloaded in step 3)
    mysql-connector-java-8.0.20.zip The Mysql Connector for Java to be downloaded from here
  7. Create a VirtualBox Host-only network adapter with IPv4 Address/Mask 192.168.140.1/24, if one is not already defined with this IPv4 Address/Mask.
    • In VirtualBox, go to File -> Host Network Manager, then define a new Host-only network adapter.
      Adapter
      - Name: VirtualBox Host-Only Ethernet Adapter
      - IPv4 Addrees: 192.168.140.1
      - IPv4 Network Mask: 255.255.255.0
      DHCP Server
      - Enable Server: no
      
    • If you want to use an existing Host-only network adapter, you will need to add the property name and change the property ip for config.vm.network in file vagrant/VagrantFile accordingly. Networking configuration documentation is available here.

  8. Navigate to folder vagrant (under dstutorial-release-20210428). You'll see that there is a file called Vagrantfile, that is the configuration file for the provisioning.
  9. Open a command prompt and run:
  10. vagrant up > provisioning.log

    This command stores the provisioning log into a file, provisioning.log to monitor that everything is going as expected. This command is valid if you are using the Git Bash terminal or the Windows Command Prompt (CMD). The provisioning process lasts between 20 and 35 minutes, depending on your hardware and Internet bandwidth. If you open the VirtualBox Manager during the provisioning, you'll find a machine called denodo.dstutorial.com being created and booted.

  11. When the vagrant up command ends it will return the cursor. Check that the tail of provisioning.log contains the following lines:
  12. default: check Meter Readings db (postgresql): Product Name : PostgreSQL
    default: Product Version : 12.6 (Ubuntu 12.6-0ubuntu0.20.04.1)
    default: check Weather db (mysql): Product Name : MySQL
    default: Product Version : 8.0.23-0ubuntu0.20.04.1
    default: check Building location (xlsx over sftp): OK
    default: check Holidays data sources (web services): OK
    default: check file GHCN file (csv over sftp) site0_daily: OK
    default: check file GHCN file (csv over sftp) site2_daily: OK
    default: check file GHCN file (csv over sftp) site4_daily: OK
    default: check file GHCN file (csv over sftp) site13_daily: OK
    default: check file GHCN file (csv over sftp) site15_daily: OK
    default: check file GHCN file (csv over sftp) site0_monthly: OK
    default: check file GHCN file (csv over sftp) site2_monthly: OK
    default: check file GHCN file (csv over sftp) site4_monthly: OK
    default: check file GHCN file (csv over sftp) site13_monthly: OK
    default: check file GHCN file (csv over sftp) site15_monthly: OK
    default: check final prediction web service on dstutorial_sample_completed (v1): 200
    default: check final prediction web service on dstutorial_sample_completed (v2): 200
    default: This machine has IP: 192.168.140.100
    default: You may want to add it, to your local hosts file with name denodo.dstutorial.com
    default: provisioning started: Thu Apr  8 11:09:54 UTC 2021
    default: provisioning ended: Thu Apr  8 11:43:14 UTC 2021
    
  13. Add to the hosts file the entry for the virtual machine IP. In Windows the file is C:\Windows\System32\drivers\etc\hosts:
  14. ## Machine hosting the Data Science Tutorial Environment
    192.168.140.100		denodo.dstutorial.com
    

    The IP must match the one specified in config.vm.network in the VagrantFile.

At this point you can already connect to the Denodo applications deployed in the VM:

If you want to start again from a clean virtual machine, you should:

  1. Turn off your VirtualBox guest.
  2. Navigate to folder vagrant.
  3. Run vagrant destroy -f. This command erases your VirtualBox guest
  4. Run vagrant up > provisioning.log.

Appendix

Useful Technical Information

  • The guest OS is Ubuntu 20.04.2 LTS.
  • If you need to access the virtual machine via ssh/sftp, you can use the user denusr with password denusr. This is an administrator (sudo) user.
  • Denodo installation is under /opt/denodo/8.0.
  • All the services are deployed as systemd services, including Denodo ones. They are all setup to autostart at boot time. The user that runs the services is denusr.
    Service Name Systemd unit file
    Virtual DataPort Server denodo_vdp
    Web Design Studio denodo_design_studio
    Data Catalog denodo_data_catalog
    Apache Zeppelin for Denodo zeppelin
    ML Web Service flask_mlpred_rest
    Holiday Web Service flask_holiday_rest
    Along with these services are those for the data sources (Postgres, MySQL and MongoDB) installed with the standard distribution software repositories.

    Troubleshooting

  • If you get an error when running vagrant up with code VERR_INTNET_FLT_IF_NOT_FOUND, please refer to this StackOverflow solution.