Metro traffic forecasting¶

Title: Metro traffic forecasting
Author: Michal Bezak, Tangent Works
Industry: Transportation
Area: Public transport
Type: Forecasting

Description¶

Metro is one of the most popular and useful means of public transport across the globe. It cuts travelling time for millions of people every day, and so its availability (at the right capacity and time) is critical.

Operations of metro requires precise management system. Making accurate forecasts about volume of passengers travelling on concrete lines on certain day (and time) supports decisions about with timely and right-sized dispatch of resources - having the right amount of cars prepared with the right personnel.

Business parameters¶

Business objective: Optimal resource allocation
Business value: Customer (traveler) experience, Service availability, Employee experience
KPI: -
In [1]:
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json
import datetime

from sklearn.metrics import mean_absolute_error, mean_squared_error
from math import sqrt

import tim_client

Credentials and logging

(Do not forget to fill in your credentials in the credentials.json file)

In [2]:
with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                       # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'
In [3]:
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
In [4]:
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER
[INFO] 2021-01-15 10:59:25,533 - tim_client.api_client:save_json:66 - Saving JSONs functionality has been disabled
[INFO] 2021-01-15 10:59:25,535 - tim_client.api_client:json_saving_folder_path:75 - JSON destination folder changed to logs

Dataset¶

Data contain traffic volume for Interstate 94 Westbound ATR station 301 in Minnesota US, sampled hourly. It is enhanced with holiday and weather predictors.

Sampling and gaps¶

Data are sampled on hourly basis and can contain occasional gaps.

Data¶

Column name Description Type Availability
date_time DateTime Hour of the data collected in local CST time Timestamp column
traffic_volume I-94 ATR 301 reported westbound traffic volume Target t-1
holiday US National holidays plus regional holidays, Minnesota State Fair (0 o 1) Predictor t+34
temp Average temperature in kelvin Predictor t+34
rain_1h mm of rain occurred in the hour Predictor t+34
snow_1h mm of snow that occurred in the hour Predictor t+34
clouds_all Percentage of cloud cover Predictor t+34
Clear Weather descriptor stored as binary value (0 or 1) Predictor t+34
Clouds Weather descriptor stored as binary value (0 or 1) Predictor t+34
Drizzle Weather descriptor stored as binary value (0 or 1) Predictor t+34
Fog Weather descriptor stored as binary value (0 or 1) Predictor t+34
Haze Weather descriptor stored as binary value (0 or 1) Predictor t+34
Mist Weather descriptor stored as binary value (0 or 1) Predictor t+34
Rain Weather descriptor stored as binary value (0 or 1) Predictor t+34
Smoke Weather descriptor stored as binary value (0 or 1) Predictor t+34
Snow Weather descriptor stored as binary value (0 or 1) Predictor t+34
Squall Weather descriptor stored as binary value (0 or 1) Predictor t+34
Thunderstorm Weather descriptor stored as binary value (0 or 1) Predictor t+34

For predictors with availability of t+N where N>0 it is assumed that forecast values are available.

Forecasting situation¶

TIM detects forecasting situation from current "shape" of data, e.g. if last target value is available at 13:00 timestamp, it will start forecasting as of 14:00. It will also takes the last 13:00 timestamp as a reference point, against which availability per each column in dataset is determined - this rule is then followed for back-testing when calculating results for out-of-sample interval.

In our case all predictor values were available 34 steps ahead.

We want to back-test forecasting situation at 14:00 to predict values for the next day. At this hour, the last available value for target is for 13:00, we will predict values until the end of the next day (23:00).

CSV file used in experiments can be downloaded here.

Source¶

Dataset was adapted from Metro Interstate Traffic Volume Data Set shared at UCI Machine Learning Repository.

Traffic data was provided by Department of Transportation Minnesota.

Weather data provided by OpenWeatherMap.

Read dataset for given situation.

In [7]:
data = tim_client.load_dataset_from_csv_file('metro_traffic_situation_at_14_2.csv', sep=',')
In [8]:
data.head()
Out[8]:
date_time traffic_volume holiday temp rain_1h snow_1h clouds_all Clear Clouds Drizzle Fog Haze Mist Rain Smoke Snow Squall Thunderstorm
0 2016-01-01 00:00:00 1513.0 1 265.94 0.0 0.0 90 0 0 0 0 1 0 0 0 0 0 0
1 2016-01-01 01:00:00 1550.0 1 266.00 0.0 0.0 90 0 0 0 0 0 0 0 0 1 0 0
2 2016-01-01 03:00:00 719.0 1 266.01 0.0 0.0 90 0 0 0 0 0 0 0 0 1 0 0
3 2016-01-01 04:00:00 533.0 1 264.80 0.0 0.0 90 0 1 0 0 0 0 0 0 0 0 0
4 2016-01-01 05:00:00 586.0 1 264.38 0.0 0.0 90 0 1 0 0 0 0 0 0 0 0 0

NaN at the end of target (traffic_volume) column clearly depicts reality of data at the time of forecasting, e.g. at 14:00, with the last available data point for 13:00, we want to predict volume until the end of the next day.

In [9]:
data.tail(35)
Out[9]:
date_time traffic_volume holiday temp rain_1h snow_1h clouds_all Clear Clouds Drizzle Fog Haze Mist Rain Smoke Snow Squall Thunderstorm
23049 2018-09-29 13:00:00 4553.0 0 279.66 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23050 2018-09-29 14:00:00 NaN 0 280.31 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23051 2018-09-29 15:00:00 NaN 0 280.99 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23052 2018-09-29 16:00:00 NaN 0 281.41 0.00 0.0 90 0 0 1 0 0 0 0 0 0 0 0
23053 2018-09-29 17:00:00 NaN 0 281.44 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23054 2018-09-29 18:00:00 NaN 0 281.02 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23055 2018-09-29 19:00:00 NaN 0 280.68 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23056 2018-09-29 20:00:00 NaN 0 280.55 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23057 2018-09-29 21:00:00 NaN 0 280.40 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23058 2018-09-29 22:00:00 NaN 0 280.54 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23059 2018-09-29 23:00:00 NaN 0 280.32 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23060 2018-09-30 00:00:00 NaN 0 280.30 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23061 2018-09-30 01:00:00 NaN 0 280.19 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23062 2018-09-30 02:00:00 NaN 0 280.07 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23063 2018-09-30 03:00:00 NaN 0 280.08 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23064 2018-09-30 04:00:00 NaN 0 279.88 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23065 2018-09-30 05:00:00 NaN 0 279.96 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23066 2018-09-30 06:00:00 NaN 0 280.17 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23067 2018-09-30 07:00:00 NaN 0 280.16 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23068 2018-09-30 08:00:00 NaN 0 280.28 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23069 2018-09-30 09:00:00 NaN 0 280.62 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23070 2018-09-30 10:00:00 NaN 0 281.38 0.00 0.0 75 0 1 0 0 0 0 0 0 0 0 0
23071 2018-09-30 11:00:00 NaN 0 282.18 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23072 2018-09-30 12:00:00 NaN 0 282.69 0.00 0.0 75 0 1 0 0 0 0 0 0 0 0 0
23073 2018-09-30 13:00:00 NaN 0 283.03 0.00 0.0 90 0 0 0 0 0 0 1 0 0 0 0
23074 2018-09-30 14:00:00 NaN 0 283.48 0.00 0.0 90 0 0 0 0 0 0 1 0 0 0 0
23075 2018-09-30 15:00:00 NaN 0 283.84 0.00 0.0 75 0 0 0 0 0 0 1 0 0 0 0
23076 2018-09-30 16:00:00 NaN 0 284.38 0.00 0.0 75 0 0 0 0 0 0 1 0 0 0 0
23077 2018-09-30 17:00:00 NaN 0 284.79 0.00 0.0 75 0 1 0 0 0 0 0 0 0 0 0
23078 2018-09-30 18:00:00 NaN 0 284.20 0.25 0.0 75 0 0 0 0 0 0 1 0 0 0 0
23079 2018-09-30 19:00:00 NaN 0 283.45 0.00 0.0 75 0 1 0 0 0 0 0 0 0 0 0
23080 2018-09-30 20:00:00 NaN 0 282.76 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23081 2018-09-30 21:00:00 NaN 0 282.73 0.00 0.0 90 0 0 0 0 0 0 0 0 0 0 1
23082 2018-09-30 22:00:00 NaN 0 282.09 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0
23083 2018-09-30 23:00:00 NaN 0 282.12 0.00 0.0 90 0 1 0 0 0 0 0 0 0 0 0

Visualisation of data.

In [10]:
fig = plt.subplots.make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.02)  

fig.add_trace( go.Scatter( x = data.loc[:, "date_time"], y=data.loc[:, "traffic_volume"], name = "traffic_volume", line=dict(color='blue')), row=1, col=1) 

fig.update_layout(height=500, width=1000, title = 'Traffic volume')                           

fig.show()

Engine settings¶

Parameters that need to be set are:

  • Prediction horizon (Sample + 34) - because we want to predict values until the end of the next day.
  • Back-test length.

We also ask for additional data from engine to see details of sub-models so we define extendedOutputConfiguration parameter as well.

30% of data will be used for out-of-sample interval.

In [11]:
backtest_length = int( data.shape[0] * .3 )

backtest_length
Out[11]:
6925
In [12]:
configuration_backtest = {
    'usage': {                                 
        'predictionTo': { 
            'baseUnit': 'Sample',             
            'offset': 34                        # number of units we want to predict into the future (24 hours in this case)
        },
        'backtestLength': backtest_length       # number of samples that are used for backtesting (note that these samples are excluded from model building period)
    },
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True       # flag that specifies if the importances of features are returned in the response
    }
}

Experiment iteration(s)¶

There is only one forecasting situation, and we want to see how default settings will work, so we will run only one iteration.

In [13]:
backtest = api_client.prediction_build_model_predict(data, configuration_backtest)     # running the RTInstantML forecasting using data and defined configuration
backtest.status                                                                        # status of the job
Out[13]:
'FinishedWithWarning'
In [14]:
backtest.result_explanations
Out[14]:
[{'index': 1,
  'message': 'Predictor traffic_volume has a value missing for timestamp 2016-01-01 02:00:00.'},
 {'index': 2,
  'message': 'Predictor rain_1h contains an outlier or a structural change in its most recent records.'}]
In [15]:
out_of_sample_predictions = backtest.aggregated_predictions[3]['values']  # 3 will point to ouf-of-sample interval and the next day values only
In [16]:
out_of_sample_predictions.rename( columns = {'Prediction':'traffic_volume_pred'}, inplace=True)
In [17]:
out_of_sample_timestamps = out_of_sample_predictions.index.tolist()
In [18]:
evaluation_data = data.copy()

evaluation_data['date_time'] = pd.to_datetime(data['date_time']).dt.tz_localize('UTC')
evaluation_data = evaluation_data[ evaluation_data['date_time'].isin( out_of_sample_timestamps ) ]

evaluation_data.set_index('date_time',inplace=True)

evaluation_data = evaluation_data[ ['traffic_volume'] ]
In [19]:
evaluation_data = evaluation_data.join( out_of_sample_predictions )
evaluation_data.head()
Out[19]:
traffic_volume traffic_volume_pred
date_time
2017-12-15 01:00:00+00:00 490.0 491.966
2017-12-15 02:00:00+00:00 358.0 385.901
2017-12-15 03:00:00+00:00 387.0 379.254
2017-12-15 04:00:00+00:00 836.0 720.545
2017-12-15 05:00:00+00:00 2773.0 2540.960

Insights - inspecting ML models¶

Simple and extended importances are available for you to see to what extent each predictor contributes in explaining variance of target variable.

In [20]:
simple_importances = backtest.predictors_importances['simpleImportances']
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True) 

simple_importances = pd.DataFrame.from_dict( simple_importances )

simple_importances
Out[20]:
importance predictorName
0 88.94 traffic_volume
1 6.13 holiday
2 2.12 clouds_all
3 1.55 Snow
4 0.80 Clouds
5 0.38 Clear
6 0.07 Rain
In [21]:
fig = go.Figure()

fig.add_trace(go.Bar( x = simple_importances['predictorName'],
                      y = simple_importances['importance'] )
             )

fig.update_layout(
        title='Simple importances'
)

fig.show()
In [22]:
extended_importances = backtest.predictors_importances['extendedImportances']
extended_importances = sorted(extended_importances, key = lambda i: i['importance'], reverse=True) 

extended_importances = pd.DataFrame.from_dict( extended_importances )
In [23]:
fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '11:00:00' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '11:00:00' ]['importance'] )
             )

fig.update_layout(
        title='Predictor importances for the model used for predictions of values for 11:00',
        height = 700
)

fig.show()

fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '02:00:00' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '02:00:00' ]['importance'] )
             )

fig.update_layout(
        title='Predictor importances for the model used for predictions of values for 02:00',
        height = 700
)

fig.show()

Evaluation of results¶

Results for out-of-sample interval.

In [24]:
fig = go.Figure()

fig.add_trace( go.Scatter(x=evaluation_data.index, y=evaluation_data['traffic_volume'].values, name='Actual', line_shape='linear'))

fig.add_trace( go.Scatter(x=evaluation_data.index, y=evaluation_data['traffic_volume_pred'].values, name='Predicted', line_shape='linear'))

fig.update_layout( autosize=False, width=1200, height=500 )

fig.show()
In [25]:
MAE = mean_absolute_error( evaluation_data['traffic_volume'].values , evaluation_data['traffic_volume_pred'].values )
MAE
Out[25]:
228.01552092686455
In [26]:
RMSE = sqrt( mean_squared_error( evaluation_data['traffic_volume'].values , evaluation_data['traffic_volume_pred'].values ) )
RMSE
Out[26]:
378.52242798669397
In [29]:
evaluation_data['err_pct'] = ( evaluation_data['traffic_volume'] - evaluation_data['traffic_volume_pred'] ) / evaluation_data['traffic_volume']
In [30]:
MAPE = abs(evaluation_data['err_pct']).mean()
MAPE
Out[30]:
0.10241788405181448

Summary¶

We can see that TIM, with default settings and simple dataset achieved accuracy for day-ahead prediction with MAPE 10.24 %.

Imagine enhancing data with additional information relevant for given line/region/infrastructure such as: information about planned events in venues surrounding metro stations, volume information about connecting means of transport, etc.