The data to be used in the tutorial is distributed in several data sources.
In the table below you can see an overview of what the data sources are, their approximate size and their format.
The time-labeled target variable,
meter_reading, is in the PostgreSQL training dataset.
All remaining data assets will be used to extract and engineer features to be leveraged as predictors for the learning algorithm:
~ 20M rows
~ 42M rows
|Weather for test data||contest||19MB
~ 280k rows
|Weather for training data||contest||10MB
|Site-to-geolocation mapping ||contest forum||11KB||Xlsx over sftp||sftp.dstutorial.com|
|Daily and monthly weather data from stations closed to the site||National Centers for Environmental Information||20MB||Csv over sftp||sftp.dstutorial.com|
|Provides time-related features (month, quarter, year, ...)||TPC-DS Kit||10MB||PostgreSQL||postgres.dstutorial.com|
|Whether a day is holiday in a given location||Generated from the PyPi Holidays project||~ 10KB||Web service||ws.dstutorial.com|
 The sites locations have been obtained from here. The sites locations were not provided by the competition organizers but inferred by the competitors. We are going to include only those sites that were labeled as confirmed.
 Weather data has been obtained by taking the data from all the stations in the site's county at the monthly and the daily level. The data has been downloaded from here(Climate Data Online - CDO) and for site 2 from here. The two links seem to give the same GHCN data (Global Historical Climatology Network)