Imagine your business has tens (or even hundreds) of stores spread across multiple locations. Regularly predicting sales for each store can be a challenging task.
This solution template will demonstrate how to use TIM with panel dataset. We will build one general model for all stores. If you are interesting more in modeling each store individually, check the solution template Store sales forecasting
import requests
import pandas as pd
import numpy as np
import time
import datetime as dt
import json
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
from tim import Tim
pd.set_option('mode.chained_assignment', None)
Credentials and logging
(Do not forget to fill in your credentials in the credentials.json file)
with open('credentials.json') as f:
tim_credentials = json.load(f) # loading the credentials from credentials.json
TIM_URL = 'https://tim-platform.tangent.works/api/v5' # URL to which the requests are sent
tim_client = Tim(email = tim_credentials['email'], password = tim_credentials['password'], endpoint = TIM_URL)
The dataset contains weekly data from 45 stores from 2010-01-30 until 2012-10-20 (143 samples per Store). The target variable is weekly sales, and the dataset contains the following additional predictors: an indicator of whether the calendar week contains holiday, temperature, fuel price, cpi, unemployment rate, size of the store type of the store (A, B, C) and special holiday.
Data are sampled weekly.
Column name | Description | Type | Availability |
---|---|---|---|
Store | The store number | Group key column | |
Date | Timestamp | Timestamp column | |
Weekly_Sales | Binary value indicating if the calendar week contains holiday | Target | t+0 |
IsHoliday | Indicating if marketing campaign was running, binary values 0 or 1 | Predictor | t+4 |
Temperature | Average temperature in the region | Predictor | t+4 |
Fuel_Price | Cost of fuel in the region | Predictor | t+4 |
CPI | The consumer price index | Predictor | t+4 |
Unemployment | The unemployment rate | Predictor | t+4 |
Size | Size of store | Predictor | t+4 |
isA | Binary indicator type of store A | Predictor | t+4 |
isB | Binary indicator type of store B | Predictor | t+4 |
isC | Binary indicator type of store C | Predictor | t+4 |
SpecialHoliday | Binary indicator indicates whether it is week before Christmas and black Friday or normal day | Predictor | t+4 |
Our goal is to predict the next four sales samples (4 weeks). Data alignments of the last observations for each store indicate that all predictors are available throughout the whole prediction horizon.
The dataset is from the kaggle. In the original data, each store contains weekly sales for individual department. Since data on department level are gappy, we aggregated them to the store level.
CSV file used in this experiment can be downloaded here.
data = pd.read_csv('panel_data_retail.csv', sep=',') # loading data from panel_data_retail.csv
display(data.head()) # Quick look at the data
data.tail()
Store | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | CPI | Unemployment | Size | isA | isB | isC | SpecialHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2010-01-30 | 1643690.90 | 0 | 42.31 | 2.572 | 211.096358 | 8.106 | 151315 | 1 | 0 | 0 | 0 |
1 | 1 | 2010-02-06 | 1641957.44 | 1 | 38.51 | 2.548 | 211.242170 | 8.106 | 151315 | 1 | 0 | 0 | 0 |
2 | 1 | 2010-02-13 | 1611968.17 | 0 | 39.93 | 2.514 | 211.289143 | 8.106 | 151315 | 1 | 0 | 0 | 0 |
3 | 1 | 2010-02-20 | 1409727.59 | 0 | 46.63 | 2.561 | 211.319643 | 8.106 | 151315 | 1 | 0 | 0 | 0 |
4 | 1 | 2010-02-27 | 1554806.68 | 0 | 46.50 | 2.625 | 211.350143 | 8.106 | 151315 | 1 | 0 | 0 | 0 |
Store | Date | Weekly_Sales | IsHoliday | Temperature | Fuel_Price | CPI | Unemployment | Size | isA | isB | isC | SpecialHoliday | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6430 | 45 | 2012-09-22 | 713173.95 | 0 | 64.88 | 3.997 | 192.013558 | 8.684 | 118221 | 0 | 1 | 0 | 0 |
6431 | 45 | 2012-09-29 | NaN | 0 | 64.89 | 3.985 | 192.170412 | 8.667 | 118221 | 0 | 1 | 0 | 0 |
6432 | 45 | 2012-10-06 | NaN | 0 | 54.47 | 4.000 | 192.327265 | 8.667 | 118221 | 0 | 1 | 0 | 0 |
6433 | 45 | 2012-10-13 | NaN | 0 | 56.47 | 3.969 | 192.330854 | 8.667 | 118221 | 0 | 1 | 0 | 0 |
6434 | 45 | 2012-10-20 | NaN | 0 | 58.85 | 3.882 | 192.308899 | 8.667 | 118221 | 0 | 1 | 0 | 0 |
To work with panel data, we have to specify group keys when uploading data. In our case, it is only the column Store. We must also set the timestamp column since it is in the second column in our data but is the first column by default.
data_configuration = { # Upload dataset configuration
'name': 'panelDataRetail',
'timestampColumn': 2,
'groupKeys': ['Store']
}
upload_results = tim_client.upload_dataset(dataset = data, configuration = data_configuration) # Upload dataset via python client
dataset_id = upload_results.dataset['id'] # Read dataset id
version_id = upload_results.dataset['latestVersion']['id'] # Read dataset version id
print("Dataset id: ", dataset_id) # Display dataset id and dataset version id
print("Dataset version id: ", version_id)
Dataset id: f212fc52-6502-4878-850e-263577803630 Dataset version id: b9f10778-08f1-48e2-8ab6-a09c40281c05
store = 1
namesPred = data.columns[3:]
v_data = data[data.Store == store]
v_data = v_data.sort_values("Date")
fig = go.Figure(go.Scatter(x=v_data.Date, y=v_data.Weekly_Sales, name='Weekly_Sales', line=dict(color='black')))
for p in namesPred:
fig.add_trace(go.Scatter(x=v_data['Date'], y=v_data[p], name=p))
fig.update_layout(height=500, width=1000, title_text=f"Data visualization store {store}")
fig.show()
We will omit three stores from the training data (1,3,6). It has two reasons. First is, we would like to demonstrate zero history models. If data have short or no history, we can use a model trained on similar datasets. Secondly, a model does not have to be trained on all groups to be accurate. Suppose the amount of data and generated features would exceed the worker's available RAM. In that case, TIM will get rid of the oldest observations, and it may switch off polynomial features. Training only on a subset shortens model building process and may improve accuracy.
We will set predictionTo to 4 samples ahead and rollingWindow to 1 sample because we would like to obtain for each timestamp prediction of each quality (S+1 - S+4). Out-of-sample period will by "2011-11-19" - "2012-10-20" i.e. last 45 samples from the last target timestamp. The rest will be in-sample. We will simulate short history models, therefore we will allow looking back at only 8 samples (weeks): we will not include moving averages to features, and we will set offsetLimit to -10.
outOfSampleRows = 45 # Use last 45 samples as out of sample data
predictionTo = 4 # Prediction horizon 4 samples
outOfSampleStores = [1,3,6] # Stores removed from training
inSampleStores = list(set([*range(1, 46)])-set(outOfSampleStores)) # Stores on which the model is build
job_configuration = { # Configuration of store
"configuration": {
"predictionTo": {"baseUnit": "Sample","value": predictionTo},
"rollingWindow": {"baseUnit": "Sample","value": 1},
"offsetLimit":{ "value": -10},
"features": [ "Periodic", "Intercept", "PiecewiseLinear", "TimeOffsets", "Polynomial", "Identity", "Month", "Fourier" ]
},
"data": {
"outOfSampleRows": {"baseUnit": "Sample","value": outOfSampleRows}, # Define out of sample period, rest will be used as in-sample
"preprocessors": [{
"type": "CategoryFilter", # Filter only stores on which should be trained
"value": {
"column": "Store",
"categories": inSampleStores
}
}]
}
}
build_model_results = tim_client.build_forecasting_model_and_execute(dataset_id, job_configuration)
build_job_id = build_model_results.metadata['id']
resultsTableBuild = build_model_results.table_result
resultsTableBuild.head()
Store | timestamp | date_from | time_from | target | forecast | forecast_type | relative_distance | model_index | samples_ahead | lower_bound | upper_bound | bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2010-01-30T00:00:00.000Z | 2010-01-02 | 00:00:00.0 | 2136989.46 | NaN | InSample | D+28 | NaN | 4 | NaN | NaN | S+4:S+4 |
1 | 2 | 2010-01-30T00:00:00.000Z | 2010-01-09 | 00:00:00.0 | 2136989.46 | NaN | InSample | D+21 | NaN | 3 | NaN | NaN | S+3:S+3 |
2 | 2 | 2010-01-30T00:00:00.000Z | 2010-01-16 | 00:00:00.0 | 2136989.46 | NaN | InSample | D+14 | NaN | 2 | NaN | NaN | S+2:S+2 |
3 | 2 | 2010-01-30T00:00:00.000Z | 2010-01-23 | 00:00:00.0 | 2136989.46 | NaN | InSample | D+7 | NaN | 1 | NaN | NaN | S+1:S+1 |
4 | 2 | 2010-02-06T00:00:00.000Z | 2010-01-09 | 00:00:00.0 | 2137809.50 | NaN | InSample | D+28 | NaN | 4 | NaN | NaN | S+4:S+4 |
Endpoint to register predict job was not implemented in the time of writing this notebook. Therefore, we will firstly authorize the user and then directly call the API. We will register predict job using model from the job with id _build_jobid.
# authorization
auth_url = TIM_URL+'/auth/login'
auth_response = requests.request(method="POST", url=auth_url, json=tim_credentials).json()
auth_token = auth_response['token']
# predict job configuration
predict_job_configuration = {
"configuration": {
"predictionTo": {"baseUnit": "Sample","value": predictionTo},
"rollingWindow": {"baseUnit": "Sample","value": 1},
},
"data": {
"version": {"id": version_id},
"outOfSampleRows": {"baseUnit": "Sample","value": outOfSampleRows},
"preprocessors": [{
"type": "CategoryFilter",
"value": {
"column": "Store",
"categories": outOfSampleStores
}
}]
}
}
# registration of predict job
predict_url = TIM_URL+'/forecast-jobs/'+ build_job_id +'/predict'
predict_register_response = requests.request(method="POST", url=predict_url, headers={'Authorization':'Bearer '+auth_token},json=predict_job_configuration).json()
predict_job_id = predict_register_response['id']
# Execution of job registered with id _predict_job_id_ and waiting for results
predict_response = tim_client.execute_forecast(forecast_job_id = predict_job_id, wait_to_finish = 'true')
resultsTablePredict = predict_response.table_result
resultsTablePredict.head()
Store | timestamp | date_from | time_from | target | forecast | forecast_type | relative_distance | model_index | samples_ahead | lower_bound | upper_bound | bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2011-11-19T00:00:00.000Z | 2011-10-22 | 00:00:00.0 | 2033320.66 | 2.144742e+06 | OutOfSample | D+28 | 4 | 4 | 2.034620e+06 | 2.272719e+06 | S+4:S+4 |
1 | 1 | 2011-11-19T00:00:00.000Z | 2011-10-29 | 00:00:00.0 | 2033320.66 | 2.096524e+06 | OutOfSample | D+21 | 3 | 3 | 1.990232e+06 | 2.214860e+06 | S+3:S+3 |
2 | 1 | 2011-11-19T00:00:00.000Z | 2011-11-05 | 00:00:00.0 | 2033320.66 | 2.226591e+06 | OutOfSample | D+14 | 2 | 2 | 2.127554e+06 | 2.336409e+06 | S+2:S+2 |
3 | 1 | 2011-11-19T00:00:00.000Z | 2011-11-12 | 00:00:00.0 | 2033320.66 | 2.258263e+06 | OutOfSample | D+7 | 1 | 1 | 2.161485e+06 | 2.371231e+06 | S+1:S+1 |
4 | 1 | 2011-11-26T00:00:00.000Z | 2011-10-29 | 00:00:00.0 | 1584083.95 | 1.888743e+06 | OutOfSample | D+28 | 4 | 4 | 1.778621e+06 | 2.016720e+06 | S+4:S+4 |
colorsOOS = ['red', 'palevioletred', 'orchid', 'violet']
for i in outOfSampleStores:
v_data = resultsTablePredict[resultsTablePredict['Store'] == i].sort_values('timestamp')
production = v_data['forecast_type'] == 'Production'
fig = go.Figure(go.Scatter(x=v_data['timestamp'][production], y=v_data['forecast'][production], name='Prediction', line=dict(color='goldenrod')))
for s in range(1,5):
outOfSample = (v_data['forecast_type'] == 'OutOfSample') & (v_data['samples_ahead'] == s)
if s == 1:
vis = True
else:
vis = 'legendonly'
fig.add_trace(go.Scatter(x=v_data['timestamp'][outOfSample], y=v_data['forecast'][outOfSample], name=f'OutOfSample S+{s}', line=dict(color=colorsOOS[s-1]), visible= vis))
fig.add_trace(go.Scatter(x=v_data['timestamp'], y=v_data['target'], name='Actuals', line=dict(color='black')))
fig.update_layout(height=500, width=1000, title_text=f"Results Store {i}")
fig.show()
colorsOOS = ['red', 'palevioletred', 'orchid', 'violet']
colorsIS = ['green', 'springgreen', 'turquoise', 'limegreen']
for i in inSampleStores[:3]:
v_data = resultsTableBuild[resultsTableBuild['Store'] == i].sort_values('timestamp')
production = v_data['forecast_type'] == 'Production'
fig = go.Figure(go.Scatter(x=v_data['timestamp'][production], y=v_data['forecast'][production], name='Prediction', line=dict(color='goldenrod')))
for s in range(1,5):
outOfSample = (v_data['forecast_type'] == 'OutOfSample') & (v_data['samples_ahead'] == s)
inSample = (v_data['forecast_type'] == 'InSample') & (v_data['samples_ahead'] == s)
if s == 1:
vis = True
else:
vis = 'legendonly'
fig.add_trace(go.Scatter(x=v_data['timestamp'][outOfSample], y=v_data['forecast'][outOfSample], name=f'OutOfSample S+{s}', line=dict(color=colorsOOS[s-1]), visible= vis))
fig.add_trace(go.Scatter(x=v_data['timestamp'][inSample], y=v_data['forecast'][inSample], name=f'InSample S+{s}', line=dict(color=colorsIS[s-1]), visible= vis))
fig.add_trace(go.Scatter(x=v_data['timestamp'], y=v_data['target'], name='Actuals', line=dict(color='black')))
fig.update_layout(height=500, width=1000, title_text=f"Results Store {i}")
fig.show()
properties = build_model_results.model_result['model']['Model Zoo']['variableProperties']
properties_df = pd.DataFrame(properties).sort_values(by='importance',ascending=False)
properties_df['rel_importance'] = properties_df['importance']/properties_df.sum()['importance']
fig = px.bar(properties_df, x="name", y="rel_importance", color="name")
fig.update_layout(height=500, width=1000, title_text="Variable Importances")
fig.show()
def find_feature(sub_parts):
dow_list = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
features_list = []
for c,s in enumerate(sub_parts):
if s['type']=='β':
sub_feature = ''
elif s['type']=='TimeOffsets':
sub_feature = s['predictor']+'(t'+str(s['offset'])+')'
elif s['type']=='RestOfWeek':
sub_feature ='DoW(t'+str(s['offset'])+') <= '+dow_list[s['day']-1]
elif s['type']=='Intercept':
sub_feature = 'Intercept('+str(int(s['value']))+')'
elif s['type']=='Cos':
sub_feature = 'Cos('+str(int(s['period']))+';'+s['unit']+')'
elif s['type']=='Sin':
sub_feature = 'Sin('+str(int(s['period']))+';'+s['unit']+')'
elif s['type']=='ExponentialMovingAverage':
sub_feature = 'EMA_'+s['predictor']+'(t'+str(int(s['offset']))+'; w='+str(int(s['window']))+')'
elif s['type']=='Identity':
sub_feature = s['predictor']
elif s['type']=='PiecewiseLinear':
sub_feature = 'max(0;'+str(s['subtype'])+'*('+str(s['knot'])+'-'+s['predictor']+'(t'+str(s['offset'])+')))'
elif s['type']=='SimpleMovingAverage':
sub_feature = 'SMA_'+s['predictor']+'(t'+str(int(s['offset']))+'; w='+str(int(s['window']))+')'
elif s['type']=='Fourier':
sub_feature = 'Fourier('+str(s['period'])+')'
elif s['type']=='Weekday':
sub_feature = '_test_Weekday_'
elif s['type']=='Month':
sub_feature = f"Month({(s['month'])})"
elif s['type']=='PublicHoliday':
sub_feature = 'PublicHoliday'
elif s['type']=='Trend':
sub_feature = 'Trend'
else:
sub_feature = '_test_'
if s['type']=='β':
features_list.append(sub_feature)
beta = s['value']
else:
features_list.append(' & '+sub_feature) if c>0 else features_list.append(sub_feature)
feature_output = ''.join(str(e) for e in features_list)
return feature_output,beta
features = []
for m in build_model_results.model_result['model']['Model Zoo']['models']:
terms = m['terms']
for count,t in enumerate(terms):
f,b = find_feature(t['parts'])
features.append([m['index'],count,f,t['importance'],b])
features_df = pd.DataFrame(features,columns=['Model','Term','Feature','importance','beta'])
features_df
fig = px.sunburst(features_df, path=['Model','Feature'], values='importance',color='Feature')
fig.update_layout(height=700,width=700,title_text='Feature Importances')
fig.show()
accuracies = np.zeros((45, 4))
metric = 'mape'
for acc in predict_response.accuracies['groups']['samplesAhead']:
accuracies[int(acc['groupKeys']) - 1, int(acc['name']) - 1] = acc['outOfSample'][metric]
for acc in build_model_results.accuracies['groups']['samplesAhead']:
accuracies[int(acc['groupKeys']) - 1, int(acc['name']) - 1] = acc['outOfSample'][metric]
accuracies_df = pd.DataFrame(accuracies, columns=['S+1', 'S+2', 'S+3', 'S+4'])
accuracies_df['Store'] = range(1,46)
for i in [1,2,3,4]:
fig = px.bar(accuracies_df, x="Store", y=f"S+{i}", color="Store")
fig.update_layout(height=500, width=900, title_text="Accuracies "+ metric)
fig.show()