Predicting car traffic for on-ramp in LA¶

Title: Predicting car traffic for on-ramp in LA
Author: Michal Bezak - Tangent Works
Industry: Transportation, Smart cities and infrastructure
Area: Traffic management
Type: Forecasting

Description¶

Smart traffic solutions are becoming increasingly important and they play vital role in making our cities (and infrastructure) smarter. They comprise of multiple parts, spanning from hardware, software, and in recent years also AI/ML.

With predictions of utilization (and potential congestion) of particular segments on (road) infrastructure it is possible to better optimize routes taken thus cut time necessary to transport goods, people etc.

Value derived from such capability can be estimated with proxy indicators such as time of people saved, or expenses on fuel consumed etc.

Business parameters¶

Business objective: Cut time required to deliver goods in certain area
Business value: Higher utilization of vans (measured by goods delivered in given time-frame); Shorter time of delivery
KPI: -
In [2]:
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json
import datetime
import copy
import math

import tim_client

Credentials and logging

(Do not forget to fill in your credentials in the credentials.json file)

In [3]:
with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                       # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'
In [4]:
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
In [5]:
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

api_client.save_json = SAVE_JSON
api_client.json_saving_folder_path = JSON_SAVING_FOLDER
[INFO] 2021-01-15 11:56:53,486 - tim_client.api_client:save_json:66 - Saving JSONs functionality has been disabled
[INFO] 2021-01-15 11:56:53,489 - tim_client.api_client:json_saving_folder_path:75 - JSON destination folder changed to logs

Dataset¶

Dataset is a combination of two (original) datasets:

  • volumes (sensor data),
  • event times of games played at nearby stadium.

It was enhanced with holidays information relevant for given timestamps as well.

Loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles, US. It is close enough to the stadium to see unusual traffic after a Dodgers game, but not so close and heavily used by game traffic.

Sampling and gaps¶

Data are sampled on 5-minutes basis and can contain gaps.

Data¶

Column name Description Type Availability
TS Timestamp Timestamp column
volume Number of cars measured for the previous five minutes Target t-1
event Binary value indicating whether match was played at nearby stadium Predictor t+6

Forecasting situations¶

Our goal is to predict next 30 minutes of traffic.

CSV file used in experiment can be downloaded here.

Source¶

Data were published at UCI Machine Learning Repository. Loop sensor measurements were obtained from the Freeway Performance Measurement System PeMS.

In [10]:
data = tim_client.load_dataset_from_csv_file('data.csv', sep=',')

We can see that last 6 data points are NaN i.e. are missing in dataset because we want to back-test predictions of the next 30 minutes (5 min x 6 = 30 min.)

In [11]:
data.tail(7)
Out[11]:
TS Volume Event Holiday
49728 2005-09-30 23:35:00 15.0 0 0
49729 2005-09-30 23:40:00 NaN 0 0
49730 2005-09-30 23:45:00 NaN 0 0
49731 2005-09-30 23:50:00 NaN 0 0
49732 2005-09-30 23:55:00 NaN 0 0
49733 2005-10-01 00:00:00 NaN 0 0
49734 2005-10-01 00:05:00 NaN 0 0
In [24]:
data.shape
Out[24]:
(49735, 4)

Zoom in closer to see events (red) lines in chart below.

In [14]:
data_for_chart_event = data['Event'].apply( lambda x: None if x==0 else x )*40
In [15]:
fig = plt.subplots.make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.02)  

fig.add_trace(go.Scatter(x = data.loc[:, "TS"], y=data.loc[:, "Volume"], name = "Volume", line=dict(color='blue')), row=1, col=1) 

fig.add_trace(go.Scatter(x = data.loc[:, "TS"], y=data_for_chart_event, name = "Event",  line=dict(color='red')), row=1, col=1) 

fig.update_layout(height=500, width=1000)                           

fig.show()

Engine settings¶

The only parameters that need to be set are:

  • Prediction horizon (Sample + 6) - because we want to predict next 30 minutes,
  • Interpolation (data imputation) - recommended due to gaps present in data (timestamps),
  • Back-test length.

We also ask for additional data from engine to see details of sub-models so we define extendedOutputConfiguration parameter as well.

30% of data will be used for out-of-sample interval.

In [16]:
backtest_length = int( data.shape[0] * .3 )

backtest_length
Out[16]:
14920
In [17]:
engine_settings = {
    'usage': {                                 
        'predictionTo': { 
            'baseUnit': 'Sample',              
            'offset': 6                        # predict next 30 minutes
        },
        'backtestLength': backtest_length     
    },
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True      # flag that specifies if the importances of features are returned in the response
    },
    'interpolation':{
        'maxLength': 24*3,
        'type': 'Linear'
    }
}

Experiment iteration(s)¶

We will run only one iteration of experiment with default settings.

In [18]:
backtest = api_client.prediction_build_model_predict(data, engine_settings)

backtest.status     
Out[18]:
'FinishedWithWarning'
In [19]:
backtest.result_explanations
Out[19]:
[{'index': 1,
  'message': 'Predictor Volume has a value missing for timestamp 2005-04-15 10:20:00.'}]
In [30]:
out_of_sample_timestamps = backtest.aggregated_predictions[1]['values'].index.tolist()

out_of_sample_predictions = pd.DataFrame.from_dict( backtest.aggregated_predictions[1]['values'] )
In [31]:
out_of_sample_predictions['Volume_pred'] = out_of_sample_predictions['Prediction']

out_of_sample_predictions = out_of_sample_predictions[ ['Volume_pred'] ]
In [32]:
evaluation_data = data.copy()

evaluation_data['TS'] = pd.to_datetime(data['TS']).dt.tz_localize('UTC')
evaluation_data = evaluation_data[ evaluation_data['TS'].isin( out_of_sample_timestamps ) ]

evaluation_data.set_index('TS',inplace=True)

evaluation_data = evaluation_data[ ['Volume'] ]
In [33]:
evaluation_data = evaluation_data.join( out_of_sample_predictions )
In [34]:
evaluation_data.head()
Out[34]:
Volume Volume_pred
TS
2005-08-10 04:20:00+00:00 3.0 1.81030
2005-08-10 04:25:00+00:00 2.0 1.97560
2005-08-10 04:30:00+00:00 5.0 2.69418
2005-08-10 04:35:00+00:00 6.0 3.32686
2005-08-10 04:40:00+00:00 5.0 4.46732

Inspecting ML models internals¶

TIM offers the view on which predictors are considered as important. Simple and extended importances are available for you to see to what extent each predictor contributes in explaining variance of target variable.

In [35]:
simple_importances = backtest.predictors_importances['simpleImportances']
simple_importances = sorted(simple_importances, key = lambda i: i['importance'], reverse=True) 

simple_importances = pd.DataFrame.from_dict( simple_importances )

simple_importances
Out[35]:
importance predictorName
0 83.07 Volume
1 11.20 Event
2 5.73 Holiday
In [36]:
fig = go.Figure()

fig.add_trace(go.Bar( x = simple_importances['predictorName'],
                      y = simple_importances['importance'] )
             )

fig.update_layout(
        title='Simple importances'
)

fig.show()
In [37]:
extended_importances = backtest.predictors_importances['extendedImportances']
extended_importances = sorted(extended_importances, key = lambda i: i['importance'], reverse=True) 

extended_importances = pd.DataFrame.from_dict( extended_importances )

extended_importances
Out[37]:
time type termName importance
0 [2] TargetAndTargetTransformation Volume(t-2) 29.01
1 [3] TargetAndTargetTransformation Volume(t-3) 28.65
2 [6] TargetAndTargetTransformation Volume(t-6) 28.47
3 [4] TargetAndTargetTransformation Volume(t-4) 28.17
4 [1] TargetAndTargetTransformation Volume(t-1) 27.66
... ... ... ... ...
115 [1] Interaction DoW(t-36) ≤ Sat & Volume(t-5) 1.70
116 [2] Calendar DoW(t-276) ≤ Thu 1.65
117 [2] TargetAndTargetTransformation Volume(t-5) 1.64
118 [1] Calendar DoW(t-276) ≤ Thu 1.54
119 [1] Interaction sin(2πt / 4.0 hours) & DoW(t-48) ≤ Fri 1.52

120 rows × 4 columns

In [38]:
fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[1]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[1]' ]['importance'] )
             )

fig.update_layout(
        title='Importances for the model used for predictions of the 1st point in prediction horizon',
        height = 700,
        width = 1000
)

fig.show()

fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[6]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[6]' ]['importance'] )
             )

fig.update_layout(
        title='Importances for the model used for predictions of the 6th point in prediction horizon',
        height = 700,
        width = 1000
)

fig.show()

Evaluation of results¶

Results for out-of-sample interval.

In [40]:
evaluation_data['err'] = evaluation_data['Volume'] - evaluation_data['Volume_pred'] 
evaluation_data['err_squared'] = evaluation_data['err']**2
In [41]:
MAE = abs(evaluation_data['err']).mean()
RMSE = math.sqrt( evaluation_data['err_squared'].mean() )
In [42]:
MAE
Out[42]:
4.332961734396008
In [43]:
RMSE
Out[43]:
5.878749520817067
In [44]:
fig = plt.subplots.make_subplots(rows=1, cols=1, shared_xaxes=True, vertical_spacing=0.02)  

fig.add_trace(go.Scatter(x = evaluation_data.index, y=evaluation_data.loc[:, "Volume"], name = "Actual", line=dict(color='blue')), row=1, col=1) 

fig.add_trace(go.Scatter(x = evaluation_data.index, y=evaluation_data.loc[:, "Volume_pred"], name = "Predicted",  line=dict(color='red')), row=1, col=1) 

fig.update_layout(height=500, width=1000)                           

fig.show()

Summary¶

We demonstrated that just with actual historical values of target, indicator of event in nearby public venue, and holiday information, TIM can build (with default settings) quite precise models for forecasting automatically. To improve accuracy further, it is recommended to enhance dataset with meteo predictors (such as temperature, rain, wind etc.), and measurements from other adjacent points on (road) infrastructure.