Configuration
We have put a lot of effort into creating a fully automatic model building engine. Still, even against our best efforts, sometimes some models do not get the highest possible accuracy. However, users can ensure that even the toughest dataset can be modeled properly by toying with the algorithm's exposed parametrization.
The following subsections go through all of the available settings of TIM Forecasting. The table below shows all configuration parameters available for different job types.
Overall configuration
Configuration parameter | build-model | rebuild-model | predict | default |
---|---|---|---|---|
Prediction to | ☑ | ☑ | ☑ | Sample +1 |
Prediction from | ☑ | ☑ | ☑ | Sample + 1 |
Model quality | ☑ | ☑ | ☐ | Combined (Very High for D+0 and D+1, High otherwise) |
Normalization | ☑ | ☑ | ☐ | true |
Model complexity | ☑ | ☑ | ☐ | automatic |
Features | ☑ | ☑ | ☐ | Polynomial, Time offsets, Identity, Intercept, Rest of week, Piecewise linear, Exponential moving average, Periodic |
Daily cycle | ☑ | ☐ | ☐ | automatic |
Allow offsets | ☑ | ☑ | ☐ | true |
Offset limit | ☑ | ☑ | ☐ | automatic |
Memory limit check | ☑ | ☑ | ☐ | true |
Rebuilding policy | ☐ | ☑ | ☐ | All |
Prediction intervals | ☑ | ☐ | ☐ | 90% |
Prediction boundaries | ☑ | ☑ | ☑ | automatic |
Rolling window | ☑ | ☑ | ☑ | 1 day (daily cycle) / Prediction to (nondaily cycle) |
Backtest | ☑ | ☑ | ☐ | All |
☑ available in a given method
☐ not available in a given method
Prediction to
This setting serves to define the forecasting horizon. It consists of a baseUnit (one of Day, Hour, Minute, Second and Sample) and a value (non-negative integer). If not set, TIM will default to one Sample ahead.
"predictionTo": {
"baseUnit": "Day",
"value": 2
}
Defining PredictionTo with Samples
This is the easiest way to define the forecasting horizon. TIM will try to forecast value samples starting from the last target observation in the dataset and using a step size equal to the sampling period estimated from the dataset (or stored in the model).
Defining PredictionTo with Day, Hour, Minute and Second
Often, a user wishes to forecast the entire following day, but does not want to count how many samples this represents (it changes based on where the last target observation currently is). This notation functions relative to the last target observation. Suppose the user sets the "predictionTo" to Day+1. In that case, TIM will recognize that it should forecast up until the last observation of the following day - ignoring where within the current day your target currently ends (parts of the datetime of the target end that are measured in a smaller granularity than baseUnit are ignored). This logic works similarly for baseUnit Hour and QuarterHour - see the table below with examples.
PredictionTo | Last target observation | Denotes all samples up until |
---|---|---|
D+1 | 28-01-2012 22:13:56 | 29-01-2012 23:59:59 |
D+0 | 28-01-2012 22:13:56 | 28-01-2012 23:59:59 |
H+1 | 28-01-2012 22:13:56 | 28-01-2012 23:59:59 |
H+0 | 28-01-2012 22:13:56 | 28-01-2012 22:59:59 |
Q+1 | 28-01-2012 22:13:56 | 28-01-2012 22:29:59 |
Q+0 | 28-01-2012 22:13:56 | 28-01-2012 22:14:59 |
Prediction from
This setting complements 'predictionTo' and allows skipping the first samples in the forecasting horizon. If not set, TIM will default to one Sample ahead, not skipping anything.
"predictionFrom": {
"baseUnit": "Sample",
"value": 3
}
Model quality
This setting is deprecated and was replaced by target offsets and predictor offsets. It controls the model complexity versus training time tradeoff. The higher the model quality, the longer it takes to build the Model Zoo. If not set, Combined will be used. Options are:
- Low:
- dummy quality, these models can be used even without any data provided,
- is replaced by setting target offsets to None and selecting only target column under columns in the data configuration and leaving the predictor offsets unset or set to Common,
- Medium:
- models without offsets of the target variable,
- is replaced by setting target offsets to None and leaving the predictor offsets unset or set to Common,
- High:
- models using only a limited amount of offsets of the target variable,
- is replaced by setting target offsets to Common and leaving the predictor offsets unset or set to Common,
- VeryHigh:
- every model uses the closest target offset possible,
- is replaced by setting target offsets to Close and leaving the predictor offsets unset or set to Common,
- Combined:
- VeryHigh quality for intra-day and day-ahead forecasts, High quality for further forecasting horizons,
- is replaced by setting target offsets to Combined and leaving the predictor offsets unset or set to Common, and
- UltraHigh:
- every model uses the closest offset possible for every single predictor,
- is replaced by setting predictor offsets to Close and leaving target offsets unset or set to Close.
Note: For the qualities Medium, High and VeryHigh, a selection of the offsets within a day is optimized to minimize training time. This may cause scenarios where two identical situations within two different prediction horizons can have slightly different models; e.g. models for S+1 may be different if the prediction horizon is in one case set to S+5 and in the other case to S+10.
"modelQuality": "High"
Target offsets
This setting controls offsets of the target variable used in the model building process. Options are:
- None: models without offsets of the target variable,
- Common: models for situations within one day using only common target offsets,
- Close: every model uses the closest target offset possible, and
- Combined: Close offsets for the first two days from the last target timestamp and Common offsets for the rest of the forecasting horizon.
The more specific offsets for individual situations are used, the longer the training time takes. For longer time horizons, using the closest possible target offsets stops improving accuracy and unnecessarily slows down model building. If the forecasting horizon is too far, then the target offsets may not be used at all. Sometimes usage of target offsets is given by use case, e.g., soft sensors simulate the target variable only from the predictors.
This setting is linked to other settings. If predictor offsets are set to Close, then the target offsets can be only None or Close. If allow offsets is set to false, then it can be only None.
Default setting is determined by TIM in the most appropiete way. If allow offsets is set to false, then default is None. If predictor offsets are set to Close, then default is Close. It is Combined in all other cases.
Note: For the Common, Close and Combined target offsets, a selection of the offsets within a day is optimized to minimize training time. This may cause scenarios where two identical situations within two different prediction horizons can have slightly different models, e.g. models for S+1 may be different if the prediction horizon is in one case set to S+5 and in the other case to S+10.
"targetOffsets": "Close"
Predictor offsets
Default setting is Common. Which means feature selection from predictors is done in batches by days. Setting this setting to Close will cause that each situation from the forecasting horizon will be trained individually with the closest possible offsets for predictors and target (if target and predictor offsets are allowed). However, it increases model building time. Close can be appropriate in the case of a short prediction horizon, if not all predictors are available during the entire prediction horizon. Options are:
- Common: models for situations within one day using only common predictor offsets, and
- Close: every model uses the closest predictors offsets possible.
"predictorOffsets": "Common"
Note: Due to the individual training, Close can affect models even if only the target variable is available.
Normalization
When normalization is on, predictors are scaled by their mean and standard deviation. Switching normalization off may help to model data with structural changes. If not provided or set to automatic, TIM will decide automatically.
"normalization": true
Model complexity
This setting determines the maximal possible number of terms in each model in the Model Zoo. Challenging datasets might require a lower model complexity. If not set, TIM will calculate the model complexity automatically based on the sampling period of the dataset.
"maxModelComplexity": 50
Features
TIM tries to enhance the model building process with new artificially created features derived from the original predictors. The following different transformations are available (those in bold are used by default):
- Piecewise linear
- Periodic components
- Weekrest
- Day of week
- Intercept
- Polynomial
- Exponential moving average
- Simple moving average
- Time offsets
- Identity
- Fourier
- Trend
- Month
- Public holidays
- One-hot encoding
It is possible to change the selection of features TIM can use by explicitly sending a list of the features to use (potentially also omitting features that are by default included).
"features": ["TimeOffsets", "Identity", "PiecewiseLinear", "ExponentialMovingAverage",
"SimpleMovingAverage", "Periodic", "Fourier", "RestOfWeek", "DayOfWeek",
"PublicHolidays", "Month", "Trend", "Intercept", "Polynomial"]
Daily cycle
This setting is a boolean value determining whether or not to use an individual model building approach for different times within a day. Doing so is beneficial if the dynamics of the underlying problem change during the day. Switching it off leads to a common model building approach for all timestamps. If the parameter is not provided, TIM will decide automatically. Learn more about the importance of this parameter in the dedicated section on daily cycle.
"dailyCycle": false
Allow offsets
Allow offsets is a boolean value that determines whether to use offsets of predictors in the model. If allow offsets is set to false, no time offsets, exponential moving average or simple moving average will be used in the model; they should not be explicitly deselected in the feature configuration. The piecewise linearity transformation will be made only from predictors that are available at the forecasted timestamp. If allow offsets is set to false, the explicit offset limit parameter cannot be set to anything other than 0. This setting is applied for all predictors including the target variable. Therefore, setting model quality to High, VeryHigh or Combined while setting allow offsets to false will return the same result as setting model quality to Medium. Calendar features may still occur in the model with offsets, since these are engine features and are obtained only from the forecasted timestamp.
"allowOffsets": false
Offset limit
Offset limit can be set as an explicit value; if it is not set, the value will be determined automatically. This value is a negative number defining how far into the past offsets can go. This setting is mainly used to generate time offsets. Only offsets from the range defined by the offset limit and the closest available offset of a variable will be considered in the model building process. The features exponential moving average, simple moving average and piecewise linearity will be calculated from a variable only if the closest available offset of the variable is closer to the dataset end than the offset limit. The features public holidays, weekrest and weekday will not be affected by this setting, since they are determined separately.
If allow offsets is set to false, the explicit offset limit cannot be set to anything other than 0. The offset limit that was used in model building can be found in the job log.
"offsetLimit": {
"type": "Explicit",
"value": -10
}
Memory limit check
TIM tries to estimate whether the worker it currently operates on has enough memory to finish the model building and forecasting process. If not, and the memory preprocessing is turned on, it will drop some of the rows and columns of the dataset and turn off some of the transformations. By default, it is turned on. If turned off, this may lead to a crash of the operation for big datasets.
"memoryLimitCheck": false
Rebuilding policy
The rebuilding policy controls which model(s) of the given parent job's Model Zoo should be rebuilt and which should be dropped. There are three different options:
- all: all models in the current Model Zoo are dropped, and new models are added;
- newSituations: only models that are needed for the given forecasting horizon that the current Model Zoo cannot handle are built and added to the Model Zoo;
- olderThan timestamp: the same behavior as newSituations, but models older than "timestamp" are deemed useless and replaced with newly built ones too. This is the only option where it makes sense to include the time parameter. The user can specify any number of days, hours, quarter-hours or samples.
"rebuildingPolicy": {
"type": "OlderThan",
"time": {
"baseUnit": "Day",
"value": 7
}
}
Prediction intervals
The prediction intervals setting expresses the uncertainty in prediction by creating an interval where the prediction should probably occur. The value of this setting expresss the probability that the prediction will be inside the symmetrical prediction interval. Therefore, with increasing value, the prediction intervals widen.
"predictionIntervals": 95
Prediction boundaries
For some datasets, values outside certain boundaries do not make sense - e.g. negative values for energy production. TIM tries to figure these out automatically, but there is an option to override these detected values. Both the lower and upper boundaries should be real values. It might be useful to turn prediction boundaries off for datasets with a visible trend.
"predictionBoundaries": {
"type": "Explicit",
"maxValue": 1000,
"minValue": 0
}
Rolling window
When TIM evaluates the models built on the in-sample and out-of-sample data, it starts rolling backwards from where the target variable ends until the start of the dataset and forecasts the whole length of the forecasting horizon each time. The user can specify the length of this rolling window to control the size of the output (using any number of days, hours, minutes, seconds or samples). By default, for daily cycle datasets a rolling window of 1 day is used and for nondaily cycle datasets a rolling window of 1 sample is used.
"rollingWindow": {
"baseUnit": "Day",
"value": 2
}
Backtest
This setting determines which types of forecasts should be returned. The Production option only returns the production forecast, the OutOfSample option also produces out-of-sample forecasts, and the All option also delivers in-sample forecasts.
"backtest": "All"
Data configuration
Configuration parameter | build-model | rebuild-model | predict | default |
---|---|---|---|---|
In-sample rows | ☑ | ☑ | ☐ | All records except Out-of-sample |
Out-of-sample rows | ☑ | ☑ | ☑ | No records |
Imputation | ☑ | ☑ | ☑ | Linear for gaps no longer than 6 |
Columns | ☑ | ☑ | ☐ | all |
Target column | ☑ | ☐ | ☐ | First non-timestamp column |
Holiday column | ☑ | ☐ | ☐ | none |
Time scale | ☑ | ☐ | ☐ | Originally estimated from dataset |
Aggregation | ☑ | ☐ | ☐ | Mean (numerical variables) / Maximum(boolean variables) |
Alignment | ☑ | ☑ | ☑ | Determined from dataset end |
Preprocessors | ☑ | ☑ | ☑ | No preprocessors |
☑ available in a given method
☐ not available in a given method
In-sample rows
This setting defines which samples should be used for model building (training). The user can specify the in-sample timestamps as an array of timestamp ranges. If not set, all timestamps but the ones defined in the 'outOfsample' rows will be used.
"inSampleRows": [
{
"from": "2009-06-01 00:00:00",
"to": "2009-06-10 23:00:00"
},
{
"from": "2009-05-01 00:00:00",
"to": "2009-05-10 23:00:00"
}
]
Alternatively, a relative notation can be used, expressed as an integer number n with its base unit (one of Day, Hour, Minute, Second and Sample). This defines the length of the time range. The type of the relative range defines the start and the direction from which it is calculated. The Last starts from the last non-missing target observation (the newest observation of the target variable) going backwards and the First starts from the first non-missing target observation (the oldest observation) going forward. If no type is specified, default value is Last.
"inSampleRows": {
"type": "Last",
"baseUnit": "Day",
"value": 2
}
If there is an intersection of the insSampleRows with the outOfSampleRows, observations in the intersection will be considered as follows:
- by default, observations in the intersection will be considered as outOfsample,
- when outOfSampleRows are defined as a relative range starting from the first target timestamp (type First), the observations in the intersection will be considered as inSample; the reasoning here is that for out-of-sample validation data towards the end of the dataset are more relevant.
Out-of-sample rows
This setting defines which samples should be used to backtest (validate) the Model Zoo. These observations will not be used during model building (training), and therefore the forecasts' accuracy on this region more closely resembles that of the real production setup. If not set, none will be used.
There are two ways to configure the out-of-sample rows:
- as an array of timestamp ranges:
"outOfSampleRows": [
{
"from": "2020-06-01 00:00:00",
"to": "2020-06-10 23:00:00"
},
{
"from": "2020-05-01 00:00:00",
"to": "2020-05-10 23:00:00"
}
]
- as an integer number n with base unit (one of Day, Hour, Minute, Second and Sample), defining the relative time range and the type of the relative range defining the start and direction (First and Last calculated from the first / last non-missing target observation, default is Last).
"outOfSampleRows": {
"type": "Last",
"baseUnit": "Day",
"value": 2
}
If there is an intersection of the insSampleRows with the outOfSampleRows, observations in the intersection will be considered as follows:
- by default, observations in the intersection will be considered as outOfsample,
- when outOfSampleRows are defined as a relative range starting from the first target timestamp (type First), the observations in the intersection will be considered as inSample; the reasoning here is that for out-of-sample validation data towards the end of the dataset are more relevant.
Imputation
The imputation setting applies if there are missing values in the dataset. Using this setting, TIM will impute all gaps in the data that are not longer than the maxLength parameter (in amount of samples). There are two available imputation methods or types: Linear (for linear interpolation) and LOCF (for Last Observation Carried Forward or imputation with the last non-missing observation). The type None turns off imputation. The default setting is Linear with maxLength 6.
"imputation": {
"type": "Linear",
"maxLength": 1
}
Columns
This setting lists all columns (given either by their names or numbers) that should be used for model building. If not provided, TIM will use all available columns. The target column should always be included.
"columns": [5, "y"]
Target column
This setting defines the column (given either by its name or number) that contains the target variable.
"targetColumn": 2
Holiday column
This setting defines the column (given either by its name or number) that contains the holiday variable. If not provided, TIM will assume there is none provided.
"holidayColumn": 5
Time scale
This setting determines the rescaling of the original dataset to another sampling period. The baseUnit of the rescaling is limited to one of Day, Hour, Minute or Second). If not set, the original estimated sampling period will be used. Time scaling only works from lower sampling periods to higher sampling periods, and does not work for data sampled monthly.
"timeScale": {
"baseUnit": "Day",
"value": 2
}
Aggregation
This setting defines the aggregation function used for the target variable; predictor variables are always aggregated by the default aggregation function. Available aggregation types are Mean, Sum, Minimum and Maximum. The default aggregation is Mean for numerical variables and the Maximum for boolean variables. It is related to the time scale parameter, as the sampling period to aggregate to is defined there.
"aggregation": "Mean"
Alignment
The alignment setting provides the possibility to set the alignment at the end of the dataset, which is useful for backtesting. This setting enables setting the timestamp of the last target observation (i.e., lastTargetTimestamp) from which the rolling window is applied and production forecasts are calculated. If not given, the last non-missing target timestamp from the original data is used. The last target timestamp cannot be lower than any out-of-sample record. Availabilities of all other variables (except target) may be given relatively to the last non-missing target timestamp. If the alignment is not provided for some variable, the alignment from the original data is taken. This means that difference between the last non-missing timestamp of a variable in the data and the last non-missing target timestamp in the data is used. For more details check the data alignment section.
"alignment": {
"lastTargetTimestamp": "2021-01-31 00:00:00Z",
"dataUntil": [
{
"column": "Sales",
"baseUnit": "Hour",
"offset": -2
}
]
}
Preprocessors
This setting provides an array of filters and transformations that will be applied on the data in the given order. (Currently only one preprocessor is defined.)
"preprocessors": [
{
"type": "CategoryFilter",
"value": {
"column": "ColumnName_1",
"categories": [1, 2, 3]
}
}
]
Type | build-model | rebuild-model | predict | default |
---|---|---|---|---|
CategoryFilter | ☑ | ☑ | ☑ | All records |
☑ available in a given method
☐ not available in a given method
Category filter
The category filter filters the data to select only those rows with specified values - i.e. belonging to a specific category or set of categories. Currently, this fitler is applied only on columns containing group keys. For more details check the documentation section about category filters. By default, all rows will be selected.
{
"type" : "CategoryFilter",
"value": {
"column": "ColumnName_1",
"categories": [1, 2, 3]
}
}