Input Data Properties
Ranging from the content of the dataset to the expected format, it is important to correctly handle data in order to get results from it. A few restrictions apply to ensure the correct interpretation of a dataset’s contents. Here, explanations regarding these restrictions can be found regardless of what interface is used to work with TIM.
Dataset type
TIM supports two main types of datasets: single time series and panel datasets.
Time-series data
Time-series data is a collection of observations for an individual entity over time. Each dataset is considered as time-series data by default.
Timestamps in a time-series dataset have to be unique. If duplicate timestamps occur, the first occurrence in the data with corresponding observations is selected and all others are ignored.
The timestamp column cannot contain missing observations (i.e. observations with no timestamp value, yet with values for other columns; not to be confused with missing data or gaps). Such a row is invalid and should be removed or if possible the missing timestamp should be added.
A dataset must contain a timestamp column and at least one variable, which can be used as the target variable.
Example
Timestamp | Sales | Holidays | Temperature |
---|---|---|---|
2022-01-01 | 11 | 1 | 11 |
2022-01-02 | 10 | 0 | 10 |
2022-01-03 | 16 | 0 | 12 |
2022-01-04 | 20 | 0 | 9 |
2022-01-05 | 0 | 8 |
Panel data
Panel data is a collection of observations for multiple entities over time. This documentation will refer to the individual entities as groups and to the variables that split the data into different groups as group keys.
Panel data is a dataset with specified group keys. If no group keys are specified, the dataset is considered as single time-series data. Group keys can correspond to one or more columns. The columns with group keys must be defined when first uploading a dataset; it is essential to know which columns are group keys from the beginning, so the validation of a dataset can run correctly. There are slight differences between the validation of classical time-series data and panel data. Once set, the group keys property cannot be changed.
The JSON for specifying the group keys of a panel dataset to the TIM API, thus defining it as panel data, looks like this:
{
"groupKeys" : ["Store ID", "Category"]
}
Unlike classical time-series data, panel data may - and most often, does - include duplicate timestamps. However, each group should contain only unique timestamps. If there is a duplicate timestamp within a group, this is handled as it is for classical time series: the first occurrence of the timestamp in the group with corresponding data is selected, and all others are ignored.
A dataset must contain a timestamp column, group keys and at least one variable, which can be used as target variable.
Group key columns cannot contain missing observations. Such rows are invalid, because they cannot be assigned to any group. They should be removed or, if possible, the group key values should be filled in.
The sampling period of a panel dataset is calculated across the whole dataset with regard to groups. Individual groups should be sampled similarly.
Example
Store ID and Category represent group keys.
Store ID | Category | Timestamp | Sales | Holidays | Temperature | Store Size |
---|---|---|---|---|---|---|
1 | Food | 2022-01-01 | 11 | 1 | 11 | 25 |
1 | Food | 2022-01-02 | 10 | 0 | 10 | 25 |
1 | Food | 2022-01-03 | 0 | 12 | 25 | |
1 | Household | 2022-01-01 | 20 | 1 | 11 | 25 |
1 | Household | 2022-01-02 | 22 | 0 | 10 | 25 |
1 | Household | 2022-01-03 | 0 | 12 | 25 | |
2 | Food | 2022-01-01 | 40 | 1 | 13 | 100 |
2 | Food | 2022-01-02 | 43 | 0 | 9 | 100 |
2 | Food | 2022-01-03 | 0 | 10 | 100 |
Dataset size
The dataset size shouldn't exceed 100 MiB. The table below gives some rough estimates of what this means in terms of observations (rows) and variables (columns, except the timestamp column).
Observations | Variables |
---|---|
4 000 000 | 1 |
1 300 000 | 10 |
170 000 | 100 |
17 000 | 1000 |
Note: This table assumes timestamp format yyyy-mm-dd HH:MM:SS and 4 numbers precision (e.g. 0.582).
Number of observations
In general, most higher sampling rate datasets should not be modeled with more than 2 years of data. More data rarely contributes to the accuracy and can sometimes even be detrimental, when underlying dependencies change over time.
Number of columns
TIM supports uploading a maximum of 1024 columns; this number includes the timestamp column, group key columns, and all variable columns. This limit is a database limit, not an algorthmic limit.
When the dataset is first uploaded, the header should contain all the columns, even if values for some of them are missing at that time. When the dataset is later updated, the structure must remain the same; even if not all values for a particular row (observation) should be updated, they must be provided (the original ones). If there are no known values for some columns, they should be sent and left empty. Missing values in the range of data sent as update will be treated as missing values in the new version of the dataset (potentially overwritting previous values).
Timestamps
To indicate the nature of the data (time series), every single observation should be connected to exactly one timestamp. For panel data, a single observation pertains to a single group, and duplicate timestamps can thus occur across groups, but not within groups. These timestamps usually correspond to the first column of the dataset and are by default assumed to be in the UTC timezone. Both the column and the timezone can however be set differently.
{
"timestampColumn" : "Timestamp",
"timeZoneName": "Europe/Bratislava"
}
The TIM Platform is ISO 8601 compliant. The formatting of timestamps is an important topic for time-series analysis and thus has its own section in the documentation that explains accepted formats in more detail.
Note: Only timestamps higher or equal to "0001-01-01 00:00:00" are supported.
Sampling rate and sampling period
A sampling rate is defined as the number of samples or observations in equidistant (sampled at a constant rate) time series per unit of time. Conversely, the sampling period is defined as the time difference between two consecutive samples or observations of equidistant time series.
Once a dataset is uploaded, TIM will try to estimate the dataset's original sampling period. This will always be one of the following time differences:
- 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 seconds
- 1, 2, 3, 4, 5, 6, 10, 12, 15, 20 or 30 minutes (expressed in seconds)
- 1, 2, 3, 4, 6, 8 or 12 hours (expressed in seconds)
- any number of days (expressed in seconds)
- any number of months (expressed in months)
TIM will try to determine the best fit based on the median distance between consecutive observations.
This doesn't mean that the data is stored differently from how they were uploaded. However, for forecasting applications, the original sampling period is used to rescale the data by default. Forecasting applications always require an equidistant distribution of timestamps, although missing data are still allowed. This means that if the data is, for example, recorded irregularly a number of times per second, TIM will internally convert the dataset to the 1 second resolution and build models that forecast with a 1 second resolution as well. If the dataset is recorded every 27 minutes, TIM Forecasting will use a version of the dataset that has a 30 minute resolution instead.
If less than two rows of data are uploaded during the first upload, the sampling period should be provided.
Data types
There are two main data types supported in TIM: numerical and categorical. Numerical data consists of values that can be measured and represented using numbers, such as age or height. Categorical data, on the other hand, consists of values that represent categories or groups, such as gender or color. Categorical data can be further divided into group keys, boolean variables, and the rest of the categorical variables.
Numerical variables
Numerical variables refer to data that can be represented by numbers and can be measured or quantified. This type of data includes variables that are continuous, such as temperature or pressure, as well as variables that are discrete, such as the number of people in a household or the number of items sold.
In TIM, numerical variables are fully supported and can be used in all modules, including TIM Detect and TIM Forecast. TIM provides various tools and techniques for analyzing and modeling numerical data, making it a valuable tool for time-series analysis and forecasting.
Group keys
Group keys are special cases of categorical data relevant for panel data. Unique combinations of values of the group keys split panel data into smaller groups; each group forms individual time-series data. Group keys can be represented as Strings or Integers and cannot contain missing values. Rows with missing group key values will not be stored, since such rows cannot be assigned to any group. The strings "na", "nan", "n/a", "missing", "null", "none", "nothing" are ignored, as these cases are considered as missing values.
Boolean variables
TIM DB provides support for special subcases of categorical data, specifically Boolean variables with only two possible values (0/1 or True/False). TIM provides extensive support for this case, including the automatic creation of features and the ability to use it as the target variable in classification tasks.
Categorical variables
In addition to group keys and Boolean variables, TIM also supports other types of categorical variables.
When uploading a dataset, categorical variables can be specified by listing them explicitly. If categorical variables are not provided, automatic detection will be run. By default, any column containing at least one non-missing string value will be considered a categorical variable.
There are some restrictions when it comes to using categorical data in TIM. TIM Detect does not support categorical data. Only numerical and boolean variables should be used when working with this module. In TIM Forecast, categorical data is supported except for using it as the target variable for classification purposes. Categorical data can be used as predictors or features in your analysis, but cannot be set as the target variable for classification when using TIM Forecast.
Variables
Target or KPI variable
Each dataset to be analysed should contain exactly one target or KPI variable: the variable to forecast for or detect on. The observations of this variable usually correspond to the second column of the dataset. Again, it is possible for a user to indicate a different target or KPI column.
Explanatory variables or predictors
TIM supports multivariate time series analysis. This means that if desired, more variables with potential explanatory power can be added to enhance modeling results. Any remaining dataset columns (i.e. any columns except the timestamp, group key and target or KPI columns) can contain these potential explanatory variables or predictors. TIM will only take into account those variables that are relevant for a specific modeling use case; however, a user can still configure this to overrule TIM's default behavior and avoid some variables or some variable transformations to be taken into account.
Predictors and their forecasts
In some applications, there are predictors for which the values can be "known" in advance - their forecasts have been made. (Examples include binary variables indicating public holidays, or meteorological variables that have been forecasted.) The quality of these forecasts tends to vary across datasets, as well as in a single dataset across measuring instruments. Meteorological predictors, for example, can be of largely varying quality depending on the instruments used to collect their values. Predictors like these have both historical actuals and forecasts. Because of this variation in quality, it can be beneficial to take into account which observations contain historical actuals and which contain predictor forecasts and to potentially treat them differently. In general, it's preferable to build your models using historical actuals and then use the predictor forecasts for model evaluation in backtesting and production. TIM can handle such variables as it supports varying data availability per variable and generates models that are aware of the situation they have been built in.
Missing data or gaps
The strings "", "na", "nan", "n/a", "missing", "null", "none", "nothing" are ignored by TIM, and thus these cases are considered as missing values for categorical variables (currently only group keys are supported).
For other types of variables, TIM tries to parse every value to a float. If it cannot, it will consider the value to be missing. Examples of values interpreted as missing are null values, categorical (non-Boolean) valuables, NA strings and infinity markers. This is not a problem though, as TIM can handle missing data during model building, and also offers multiple ways to impute missing data.