Data Description and Location

Data Science and Machine Learning

The data to be used in the tutorial is distributed in several data sources.

In the table below you can see an overview of what the data sources are, their approximate size and their format.

The time-labeled target variable, meter_reading, is in the PostgreSQL training dataset.

All remaining data assets will be used to extract and engineer features to be leveraged as predictors for the learning algorithm:

  • building metadata (size, number of floors, use ...)
  • weather data at the hourly, daily and monthly level, each one with different indicators.
  • holidays
Description Origin Size Source DB Host

Training dataset

contest 1GB
~ 20M rows
Test dataset contest 3GB
~ 42M rows
Building metadata contest 45KB Mongodb/json
Weather for test data contest 19MB
~ 280k rows
Weather for training data contest 10MB
~140K rows
Site-to-geolocation mapping [1] contest forum 11KB Xlsx over sftp
Daily and monthly weather data from stations closed to the site[2] National Centers for Environmental Information 20MB Csv over sftp
Provides time-related features (month, quarter, year, ...) TPC-DS Kit 10MB PostgreSQL
Whether a day is holiday in a given location Generated from the PyPi Holidays project ~ 10KB Web service

[1] The sites locations have been obtained from here. The sites locations were not provided by the competition organizers but inferred by the competitors. We are going to include only those sites that were labeled as confirmed.

[2] Weather data has been obtained by taking the data from all the stations in the site's county at the monthly and the daily level. The data has been downloaded from here(Climate Data Online - CDO) and for site 2 from here. The two links seem to give the same GHCN data (Global Historical Climatology Network)