Demand forecasting in retail¶

Title: Demand forecasting in retail
Author: Michal Bezak, Tangent Works
Industry: Retail
Area: Demand
Type: Forecasting
Use Case Library: Open Use Case Library

Description¶

Predicting demand i.e., how many units of a certain product will be sold at certain times, belongs to one of the most important tasks in retail. It is linked to cost incurred, or even sales opportunities, thus having impact on P&L. The implications when such forecast is incorrect are huge. If there is too much of a particular product in warehouse, it means that too much capital is allocated in inventory that just sits there and blocks room (and capital) that could have been used elsewhere (or smaller warehouse might be required). On the other hand, if there is too little, it will run out of stock too soon, which may turn to lower revenue.

It is not only about the hard numbers though. Reducing situations when some product is out of stock helps to preserve positive customer experience. Here's the list of other areas impacted by demand forecasting:

  • Financial planning (planning budgets, forecasting revenue);
  • Optimization of purchasing process;
  • Marketing campaigns planning;
  • Logistics (delivery);
  • Staff management – planning (forecasting) of resources needed;

Business parameters¶

We can see there are various Use cases that can be supported by demand forecasting. We bring two of them as an example.

A

Business objective: Availability of product(s) in demand
Value: Revenue maximization
KPI: Revenue per given product in given timeframe


B

Business objective: Minimize turnaround time of items on stock
Value: Maximize value of warehouse
KPI: Average turnaround time for given product
In [2]:
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json
import datetime

import tim_client

Credentials and logging

(Do not forget to fill in your credentials in the credentials.json file)

In [3]:
with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                        # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'
In [4]:
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
In [5]:
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER
[INFO] 2021-01-15 13:50:24,436 - tim_client.api_client:save_json:66 - Saving JSONs functionality has been disabled
[INFO] 2021-01-15 13:50:24,438 - tim_client.api_client:json_saving_folder_path:75 - JSON destination folder changed to logs

Datasets¶

Data contain units sold for particular product category (jackets and shorts). It is enhanced with information about holiday, weather, lockdown and other factors.

Sampling and gaps¶

Data are sampled on daily basis and can contain occasional gaps.

Data¶

Column name Description Type Availability
DATE Date Timestamp column
UNITS Units sold Target t-1
Min_TEMP_C Min. temperature for particular day Predictor t+2
Max_TEMP_C Max. temperature for particular day Predictor t+2
Clear Indicator of weather - clear sky Predictor t+2
Clouds Indicator of weather - clouds Predictor t+2
Drizzle Indicator of weather - drizzle Predictor t+2
Fog Indicator of weather - fog Predictor t+2
Mist Indicator of weather - mist Predictor t+2
Rain Indicator of weather - rain Predictor t+2
Snow Indicator of weather - snow Predictor t+2
Thunderstorm Indicator of weather - thunderstorm Predictor t+2
Christmas Christmas period indicator (binary) Predictor t+2
Black Friday Black Friday period indicator (binary) Predictor t+2
New Year New year period indicator (binary) Predictor t+2
Peak Period The most active period indicator (binary) Predictor t+2
Lockdown Indicator of lockdown (binary) Predictor t+2

For predictors with availability of t+N where N>0 it is assumed that forecast values are available.

Forecasting situation¶

TIM detects forecasting situation from current "shape" of data, e.g., if last target value is available for Jan 21st, it will start forecasting as of Jan 22nd. It will also take Jan 21st timestamp as reference point against which availability per each column in dataset is determined – this rule is then followed for back-testing when calculating results for the out-of-sample interval.

In our case all predictors values were available 2 steps ahead.

We want to back-test forecasting of units sold for the day after tomorrow and will do it every 2 days. Prediction horizon set to 2 days ahead will produce out-of-sample values predicted with rolling window of 2 automatically.

CSV files used in experiments can be downloaded here.

In [6]:
dataset1 ='data_jackets.csv'
dataset2 ='data_shorts.csv'

Product: Jackets¶

In [7]:
data = tim_client.load_dataset_from_csv_file( dataset1 , sep=',')

NaN at the end of target (UNITS) column shows the reality of data at the time of forecasting, e.g. on October 4th, with the last available data point for October 3rd, we want to predict UNITS for October 5th.

In [8]:
data.tail()
Out[8]:
DATE UNITS Min_TEMP_C Max_TEMP_C Clear Clouds Drizzle Fog Mist Rain Snow Thunderstorm Christmas Black Friday New Year Peak Period Lockdown
688 2020-10-01 1579.0 12.491094 22.814398 36 367 0 0 0 54 0 0 0 0 0 0 0
689 2020-10-02 2284.0 12.933676 18.963764 14 229 0 0 0 214 0 0 0 0 0 0 0
690 2020-10-03 2326.0 12.688293 21.127352 41 349 0 0 0 67 0 0 0 0 0 0 0
691 2020-10-04 NaN 13.179190 25.712166 298 107 0 0 0 52 0 0 0 0 0 0 0
692 2020-10-05 NaN 15.018446 24.809737 36 384 0 0 0 37 0 0 0 0 0 0 0

Visualisation of data.

In [9]:
timestamp_column = 'DATE'
target_column = 'UNITS'
In [10]:
fig = plt.subplots.make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.02)  

fig.add_trace( go.Scatter( x = data.loc[:, timestamp_column ], y=data.loc[:, target_column ], name = target_column, line=dict(color='blue')), row=1, col=1) 

fig.update_layout(height=500, width=1000)                           

fig.show()

Engine settings¶

Parameters that need to be set:

  • Prediction horizon (Sample + 2) - because we want to predict values until end of the next day.
  • Back-test length.

We also ask for additional data from engine to see details of sub-models so we define extendedOutputConfiguration parameter as well.

30% of data will be used for out-of-sample interval.

In [11]:
backtest_length = int( data.shape[0] * .3 )

backtest_length
Out[11]:
207
In [12]:
configuration_backtest = {
    'usage': {                                 
        'predictionTo': { 
            'baseUnit': 'Sample',             
            'offset': 2                         # number of units we want to predict into the future (24 hours in this case)
        },
        'backtestLength': backtest_length       # number of samples that are used for backtesting (note that these samples are excluded from model building period)
    },
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True       # flag that specifies if the importances of features are returned in the response
    }
}

Experiment iteration¶

For both datasets, we will run just one experiment iteration with default settings.

In [13]:
backtest = api_client.prediction_build_model_predict(data, configuration_backtest)     # running the RTInstantML forecasting using data and defined configuration
backtest.status                                                                        # status of the job
Out[13]:
'FinishedWithWarning'
In [14]:
backtest.result_explanations
Out[14]:
[{'index': 1,
  'message': 'Predictor UNITS has a value missing for timestamp 2019-05-01 00:00:00.'}]
In [15]:
out_of_sample_values = backtest.aggregated_predictions[1]['values']
out_of_sample_values.rename( columns = {'Prediction': target_column+'_pred'}, inplace=True)
out_of_sample_timestamps = out_of_sample_values.index.tolist()
In [16]:
evaluation_data = data.copy()

evaluation_data[ timestamp_column ] = pd.to_datetime(data[ timestamp_column ]).dt.tz_localize('UTC')
evaluation_data = evaluation_data[ evaluation_data[ timestamp_column ].isin( out_of_sample_timestamps ) ]

evaluation_data.set_index( timestamp_column, inplace=True)

evaluation_data = evaluation_data[ [ target_column ] ].join( out_of_sample_values )

evaluation_data.head()
Out[16]:
UNITS UNITS_pred
DATE
2020-03-11 00:00:00+00:00 1199.0 1256.67
2020-03-12 00:00:00+00:00 1404.0 1290.55
2020-03-13 00:00:00+00:00 2724.0 2179.27
2020-03-14 00:00:00+00:00 3086.0 3069.69
2020-03-15 00:00:00+00:00 1261.0 1404.71

Insights - inspecting ML models¶

TIM offers you the view on which predictors are considered important. Simple and extended importances are available for you to see to what extent each predictor contributes in explaining variance of target variable.

In [17]:
simple_importances = backtest.predictors_importances['simpleImportances']
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True) 

simple_importances = pd.DataFrame.from_dict( simple_importances )

simple_importances
Out[17]:
importance predictorName
0 84.50 UNITS
1 7.88 Black Friday
2 6.25 Max_TEMP_C
3 1.36 Fog
In [18]:
fig = go.Figure()

fig.add_trace(go.Bar( x = simple_importances['predictorName'],
                      y = simple_importances['importance'] )
             )

fig.update_layout(
        title='Simple importances'
)

fig.show()
In [19]:
extended_importances = backtest.predictors_importances['extendedImportances']
extended_importances = sorted(extended_importances, key = lambda i: i['importance'], reverse=True) 

extended_importances = pd.DataFrame.from_dict( extended_importances )

extended_importances
Out[19]:
time type termName importance
0 [1] Interaction UNITS(t-1) & DoW(t) ≤ Sat 44.41
1 [1] TargetAndTargetTransformation UNITS(t-1) 38.12
2 [2] TargetAndTargetTransformation UNITS(t-7) 25.75
3 [2] Interaction UNITS(t-2) & DoW(t) ≤ Sat 14.61
4 [2] Interaction UNITS(t-2) & sin(2πt / 168.0 hours) 11.66
5 [2] Interaction UNITS(t-2) & cos(2πt / 168.0 hours) 11.33
6 [2] Interaction UNITS(t-2) & DoW(t) ≤ Mon 8.68
7 [2] TargetAndTargetTransformation UNITS(t-2) 8.06
8 [1] Interaction UNITS(t-1) & DoW(t) ≤ Thu 7.51
9 [2] Predictor Max_TEMP_C 6.98
10 [2] Predictor Black Friday 6.43
11 [1] Predictor Black Friday 5.31
12 [1] Interaction UNITS(t-1) & (Max_TEMP_C(t) - 19.99)⁺ 4.66
13 [2] Interaction sin(2πt / 168.0 hours) & Fog 4.06
14 [2] Interaction UNITS(t-7) & UNITS(t-2) 2.43
15 [1] TargetAndTargetTransformation Intercept 0.00
16 [2] TargetAndTargetTransformation Intercept 0.00
In [20]:
fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[1]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[1]' ]['importance'] )
             )

fig.update_layout(
        title='Importances for the 1st day of prediction horizon'
)

fig.show()

fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[2]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[2]' ]['importance'] )
             )

fig.update_layout(
        title='Importances for the 2nd day of prediction horizon'
)

fig.show()

Evaluation of results¶

Results for out-of-sample interval.

In [21]:
fig = go.Figure()

fig.add_trace( go.Scatter(x=evaluation_data.index, y=evaluation_data[ target_column ].values, name='Actual', line_shape='linear'))

fig.add_trace( go.Scatter(x=evaluation_data.index, y=evaluation_data[ target_column+'_pred'].values, name='Predicted', line_shape='linear'))

fig.update_layout( autosize=False, width=1200, height=500 )

fig.show()

Accuracy metrics for the 2nd day.

In [22]:
backtest.aggregated_predictions[1]['accuracyMetricsForHorizons']['S+2']
Out[22]:
{'MAE': 720.6999662587592,
 'MSE': 1365283.018892031,
 'MAPE': 230.38067882398673,
 'RMSE': 1168.4532591815691}

Product: Shorts¶

In [23]:
data = tim_client.load_dataset_from_csv_file( dataset2 , sep=',')

Visualisation of data.

In [24]:
fig = plt.subplots.make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.02)  

fig.add_trace( go.Scatter( x = data.loc[:, timestamp_column ], y=data.loc[:, target_column ], name = target_column, line=dict(color='blue')), row=1, col=1) 

fig.update_layout(height=500, width=1000)                           

fig.show()

Engine settings¶

We will use the very same settings as for the case above.

In [25]:
backtest_length
Out[25]:
207
In [26]:
configuration_backtest
Out[26]:
{'usage': {'predictionTo': {'baseUnit': 'Sample', 'offset': 2},
  'backtestLength': 207},
 'extendedOutputConfiguration': {'returnExtendedImportances': True}}

Experiment iteration¶

In [27]:
backtest = api_client.prediction_build_model_predict(data, configuration_backtest) 
backtest.status  
Out[27]:
'FinishedWithWarning'
In [28]:
backtest.result_explanations
Out[28]:
[{'index': 1,
  'message': 'Predictor UNITS has a value missing for timestamp 2019-05-01 00:00:00.'}]
In [29]:
out_of_sample_values = backtest.aggregated_predictions[1]['values']
out_of_sample_values.rename( columns = {'Prediction': target_column+'_pred'}, inplace=True)
out_of_sample_timestamps = out_of_sample_values.index.tolist()
In [30]:
evaluation_data = data.copy()

evaluation_data[ timestamp_column ] = pd.to_datetime(data[ timestamp_column ]).dt.tz_localize('UTC')
evaluation_data = evaluation_data[ evaluation_data[ timestamp_column ].isin( out_of_sample_timestamps ) ]

evaluation_data.set_index( timestamp_column, inplace=True)

evaluation_data = evaluation_data[ [ target_column ] ].join( out_of_sample_values )

evaluation_data.head()
Out[30]:
UNITS UNITS_pred
DATE
2020-03-11 00:00:00+00:00 2189.0 4496.91
2020-03-12 00:00:00+00:00 2223.0 2454.35
2020-03-13 00:00:00+00:00 3175.0 4708.24
2020-03-14 00:00:00+00:00 4777.0 5927.58
2020-03-15 00:00:00+00:00 2875.0 3076.07

Insights - inspecting ML models¶

In [31]:
simple_importances = backtest.predictors_importances['simpleImportances']
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True) 

simple_importances = pd.DataFrame.from_dict( simple_importances )

simple_importances
Out[31]:
importance predictorName
0 55.16 UNITS
1 20.42 Peak Period
2 7.83 Black Friday
3 7.61 Christmas
4 7.07 Max_TEMP_C
5 1.39 Clouds
6 0.52 Min_TEMP_C
In [32]:
fig = go.Figure()

fig.add_trace(go.Bar( x = simple_importances['predictorName'],
                      y = simple_importances['importance'] )
             )

fig.update_layout(
        title='Simple importances'
)

fig.show()
In [33]:
fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[1]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[1]' ]['importance'] )
             )

fig.update_layout(
        title='Importances for the 1st day of prediction horizon'
)

fig.show()

fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[2]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[2]' ]['importance'] )
             )

fig.update_layout(
        title='Importances for the 2nd day of prediction horizon'
)

fig.show()

Evaluation of results¶

Results for out-of-sample interval.

In [34]:
fig = go.Figure()

fig.add_trace( go.Scatter(x=evaluation_data.index, y=evaluation_data[ target_column ].values, name='Actual', line_shape='linear'))

fig.add_trace( go.Scatter(x=evaluation_data.index, y=evaluation_data[ target_column+'_pred'].values, name='Predicted', line_shape='linear'))

fig.update_layout( autosize=False, width=1200, height=500 )

fig.show()

Accuracy metrics for the 2nd day.

In [35]:
backtest.aggregated_predictions[1]['accuracyMetricsForHorizons']['S+2']
Out[35]:
{'MAE': 1422.5082361228722,
 'MSE': 3534266.9658089527,
 'MAPE': 342.0953718480556,
 'RMSE': 1879.9646182332667}

Summary¶

We demonstrated how TIM can (with default settings) predict product demand even in times with vast structural changes (Covid-19 lockdowns).

To improve accuracy even more, we recommend enhancing dataset with predictors linked with relevant marketing activity/plans (e.g. campaign for specific product in given area etc.).