Size your Infrastructure for a TIM 5.0 Implementation
Introductionβ
This paper explains the overall capacity planning methodologies available for TIM InstantML, and explains the calculations used to obtain metrics for estimating and sizing a TIM Engine environment.
The TIM InstantML technology is deployable in various ways. TIM InstantML can be deployed as a SAAS Service that is called in your IT environment.
Alternatively, TIM InstantML can also be deployed in an On-Premise environment where you provide the server infrastructure but also in a Bring Your Own Cloud License (BYOL on Azure, AWS,...) way.
This document describes sizing considerations for an On-Premise or BYOL environment.
In the SAAS scenario, scaling of the service is done automatically. The BYOL/On-Premise solution also provides scaling but you will need to provide sufficient resources.
Related Software Versionsβ
This paper is related to following versions of the TIM software:
- TIM Engine 5.X
- TIM Studio 5.X
Architectural Componentsβ
TIM InstantML runs mainly on a Kubernetes cluster.
As an example, we provide an Azure Deployment Scheme:
- Scalability Fabric TIM Engine with queuing - AKS Cluster with D3 v2 VM β This is the Kubernetes Cluster Service implementation by Azure.
- Database - Azure SQL for PostgreSQL β This is an Azure Database Service.
- TIM Workers - ACI for TIM worker instances β This is the fast scaling Azure Container Instances service (or Kubernetes).
Typically at least two of these services will be setup for redundancy.
The number of TIM Workers are scaled up depending on the number of requests you send to the TIM Engine and therefore the number of requests in the queue.
In an On-Premise environment local Kubernetes and local PostgreSQL installs are used.
On other cloud environments appropriate services will be used. As as an example on AWS:
- Scalability Fabric TIM Engine and TIM Workers - EKS Cluster with m5.xlarge VM β This is the Kubernetes Cluster Service implementation by AWS.
- Database - PostgreSQL
- Queuing of TIM Engine tasks - AmazonMQ for RabbitMQ
Difficulties in Sizing the Environmentβ
Consider the following elements that determine the CPU time and memory requirements for creating a model or creating a forecast, classification or anomaly detection.
- The size of the data structure
- The number of predictors (columns)
- The number of timestamps (rows)
- The predictor feature importance
- Correlation between the predictor candidates and target.
This makes it difficult to have an algorithm that provides you the exact memory and CPU consumption. This document provides you benchmarking figures that make it easy to calculate the size of the architecture using benchmark data rather than rock-solid calculations.
Capacity Planning And Performance Overviewβ
Data input size Considerationsβ
The lightning fast speed of TIM InstantML is the result of efficiently using in memory processing and parallelization of computation. The default maximum memory usage is a dataset of 100 MB (measured in a CSV format). Check out Data Properties
Memory Usageβ
The only noteworthy objects that require significant space are:
- dataset
- forecasting / detection tabular output
- root cause analysis output
There are many more objects involved in the process such as model, logs, accuracy metrics and other, however they all require less then 1 MB of space. The tabular output increases in size significantly with higher forecasting horizon and smaller rolling window in case of forecasting, however this is only relevant in "backtesting" scenarios where users try different settings on historical data. In production setups, the tabular output diminishes to kilobytes because only the new timestamps are evaluated. The same goes for root cause analysis output. All in all, the only memory extensive object in a real production pipeline remains to be the dataset itself.
Processing Timeβ
There are 2 significant bottlenecks in the whole forecasting / detection process:
- dataset upload / update
- model building / rebuilding
There are usually many more steps in the whole pipeline, however they require zero to no time to process. This includes the model evaluation (forecasting / detection itself) - once the model is ready, generating forecasts / detections is done lightning fast (under second). That is why we restrict the benchmarking times to the model building
Benchmark Data and Scenariosβ
In most cases the sizing calculation is straightforward.
A typical TIM worker runs on following configuration:
- CPU
- 4 virtual CPU cores
- Memory
- 12 GB of RAM
In this benchmark we provide performance data on one single TIM Worker instance for different data set sizes.
Benchmark resultsβ
In the following tables you can find the processing response time and CPU load created by the request based on 1 TIM Worker (running on one 4 core CPU). The datasets were already uploaded before the benchmark started.
We provide benchmark for two forecasting endpoinds and different situations:
Case | Request type | New model | Backtesting |
---|---|---|---|
1 | forecasting/forecast-jobs/build-model | yes | yes |
2 | forecasting/forecast-jobs/{id}/rebuild-model | yes | yes |
3 | forecasting/forecast-jobs/{id}/rebuild-model | no | yes |
4 | forecasting/forecast-jobs/{id}/rebuild-model | no | no |
Forecasting and classification jobsβ
This benchmark was done for forecasting jobs that build models for 1-sample ahead forecasts. The benchmark is related to the forecasting execution request. There are different job types (build and rebuild), however they always call the same core underneath. Benchmark result doesn't differ for the request type per se, it does differ for the amount of models that have to be built. Imagine you call your build request first for the S+1 to S+3 horizon and then rebuild the same Model Zoo for the S+1 to S+6 horizon - in both cases, only three models are added to the Model Zoo, so the benchmark stays the same - slightly less than 3 times the respective number in the tables provided (the benchmark is only for the 1 model Model Zoo and the scaling is less than linear). The tables are always denoted with number of rows on y axis and number of variables (target variable and predictors).
Dataset size in DB | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 8192 bytes | 40 kB | 40 kB | 72 kB | 120 kB | 160 kB | 424 kB | 824 kB |
1000 | 88 kB | 120 kB | 160 kB | 472 kB | 920 kB | 1360 kB | 4024 kB | 8024 kB |
10000 | 616 kB | 936 kB | 1336 kB | 4472 kB | 8920 kB | 13 MB | 39 MB | 78 MB |
100000 | 5912 kB | 9120 kB | 13 MB | 43 MB | 87 MB | 130 MB | N/A | N/A |
1000000 | 57 MB | 89 MB | 128 MB | N/A | N/A | N/A | N/A | N/A |
Size of CSV file | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 2.6 kB | 4.9 kB | 7.8 kB | 30.9 kB | 59.9 kB | 88.9 kB | 291.9 kB | 581.7 kB |
1000 | 25.7 kB | 48.4 kB | 77.1 kB | 307.5 kB | 595.2 kB | 883.0 kB | 2.8 MB | 5.6 MB |
10000 | 256.5 kB | 482.9 kB | 770.3 kB | 3.0 MB | 5.8 MB | 8.6 MB | 28.3 MB | 56.4 MB |
100000 | 2.5 MB | 4.7 MB | 7.5 MB | 30.0 MB | 58.1 MB | 86.2 MB | N/A | N/A |
1000000 | 25.0 MB | 47.2 MB | 75.2 MB | N/A | N/A | N/A | N/A | N/A |
Case 1 and 2β
As described above, what influence the benchmark results is number of new models, which TIM is building. Therefore case 1 and case 2 are merged into one benchmark. In both cases will be 1 model build and in-sample forecast with production forecast calculated.
Max CPU usage in % | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 23 | 30 | 30 | 32 | 32 | 32 | 34 | 33 |
1000 | 28 | 30 | 34 | 34 | 34 | 34 | 34 | 34 |
10000 | 60 | 80 | 120 | 129 | 133 | 137 | 144 | 200 |
100000 | 165 | 176 | 215 | 338 | 372 | 380 | N/A | N/A |
1000000 | 149 | 192 | 198 | N/A | N/A | N/A | N/A | N/A |
Max RAM usage in % | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 46 | 46 | 46 | 46 | 47 | 47 | 47 | 47 |
1000 | 47 | 47 | 47 | 48 | 49 | 49 | 49 | 50 |
10000 | 52 | 53 | 53 | 57 | 57 | 58 | 58 | 58 |
100000 | 52 | 53 | 53 | 67 | 67 | 67 | N/A | N/A |
1000000 | 53 | 60 | 60 | N/A | N/A | N/A | N/A | N/A |
Model building and prediction time in seconds | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.04 | 0.04 | 0.05 | 0.09 | 0.11 | 0.2 | 0.5 | 0.8 |
1000 | 0.2 | 0.2 | 0.4 | 0.6 | 0.9 | 1.0 | 2.0 | 5 |
10000 | 2.5 | 3 | 4 | 10 | 13 | 14 | 32 | 61 |
100000 | 21 | 25 | 42 | 102 | 131 | 156 | N/A | N/A |
1000000 | 329 | 347 | 384 | N/A | N/A | N/A | N/A | N/A |
Total execution time in seconds | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.4 | 0.5 | 0.5 | 0.8 | 1.4 | 1.7 | 5 | 7 |
1000 | 0.5 | 0.6 | 0.8 | 1.6 | 2.5 | 2.9 | 8 | 17 |
10000 | 3.8 | 3.8 | 5 | 14 | 20 | 25 | 65 | 133 |
100000 | 25 | 30 | 50 | 135 | 195 | 262 | N/A | N/A |
1000000 | 364 | 405 | 471 | N/A | N/A | N/A | N/A | N/A |
Forecasting result table size in DB | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB |
1000 | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB |
10000 | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB |
100000 | 9.4 MB | 9.4 MB | 9.4 MB | 9.4 MB | 9.4 MB | 9.4 MB | N/A | N/A |
1000000 | 93.5 MB | 93.5 MB | 93.5 MB | N/A | N/A | N/A | N/A | N/A |
Case 3β
In this case, there is no new situation detected and no new model is build. Out-of-sample forecast (out-of-sample rows has to be set) and production forecast are calculated.
Max CPU usage in % | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 23 | 30 | 30 | 32 | 32 | 32 | 33 | 33 |
1000 | 28 | 30 | 34 | 34 | 34 | 34 | 34 | 34 |
10000 | 60 | 80 | 101 | 113 | 119 | 144 | 100 | 86 |
100000 | 100 | 127 | 157 | 130 | 108 | 101 | N/A | N/A |
1000000 | 111 | 118 | 255 | N/A | N/A | N/A | N/A | N/A |
Max RAM usage in % | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
1000 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
10000 | 34 | 34 | 35 | 34 | 34 | 34 | 34 | 34 |
100000 | 39 | 43 | 53 | 45 | 41 | 42 | N/A | N/A |
1000000 | 52 | 58 | 62 | N/A | N/A | N/A | N/A | N/A |
Model building and prediction time in seconds | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.03 | 0.04 | 0.04 | 0.05 | 0.07 | 0.08 | 0.24 | 0.4 |
1000 | 0.05 | 0.06 | 0.07 | 0.14 | 0.23 | 0.31 | 1.1 | 2.4 |
10000 | 0.4 | 0.6 | 0.7 | 3.2 | 4.5 | 6.4 | 19 | 37 |
100000 | 7 | 8 | 11 | 29 | 49 | 69 | N/A | N/A |
1000000 | 288 | 309 | 343 | N/A | N/A | N/A | N/A | N/A |
Total execution time in seconds | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.4 | 0.4 | 0.5 | 0.7 | 1.2 | 1.6 | 5 | 6 |
1000 | 0.4 | 0.4 | 0.5 | 1.1 | 2.0 | 2.8 | 7.3 | 15 |
10000 | 0.9 | 1.4 | 2.0 | 6 | 12 | 16 | 50 | 105 |
100000 | 10 | 14 | 20 | 76 | 114 | 184 | N/A | N/A |
1000000 | 326 | 368 | 433 | N/A | N/A | N/A | N/A | N/A |
Forecasting result table size in DB | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB | 9.7 kB |
1000 | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB | 95.8 kB |
10000 | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB | 957 kB |
100000 | 9.4 MB | 9.4 MB | 9.4 MB | 9.4 MB | 9.4 MB | 9.4 MB | N/A | N/A |
1000000 | 93.5 MB | 93.5 MB | 93.5 MB | N/A | N/A | N/A | N/A | N/A |
Case 4β
No new situation is detected and no new model is build. Only production forecast are calculated.
Max CPU usage in % | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 23 | 30 | 30 | 32 | 32 | 32 | 34 | 33 |
1000 | 28 | 30 | 34 | 34 | 34 | 34 | 34 | 34 |
10000 | 34 | 34 | 34 | 35 | 36 | 36 | 36 | 47 |
100000 | 45 | 45 | 45 | 45 | 49 | 60 | N/A | N/A |
1000000 | 45 | 45 | 56 | N/A | N/A | N/A | N/A | N/A |
Max RAM usage in % | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
1000 | 32 | 32 | 32 | 32 | 32 | 32 | 32 | 32 |
10000 | 34 | 34 | 35 | 34 | 34 | 34 | 34 | 34 |
100000 | 39 | 43 | 53 | 45 | 41 | 42 | N/A | N/A |
1000000 | 52 | 58 | 62 | N/A | N/A | N/A | N/A | N/A |
Model building and prediction time in seconds | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.02 | 0.02 | 0.02 | 0.03 | 0.04 | 0.13 | 0.2 | 0.5 |
1000 | 0.02 | 0.02 | 0.03 | 0.10 | 0.19 | 0.4 | 1.0 | 2.4 |
10000 | 0.04 | 0.2 | 0.3 | 1.6 | 3.0 | 4.6 | 16 | 33 |
100000 | 0.4 | 2 | 2 | 15 | 33 | 52 | N/A | N/A |
1000000 | 5 | 17 | 22 | N/A | N/A | N/A | N/A | N/A |
Total execution time in seconds | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.3 | 0.4 | 0.4 | 1.0 | 1.6 | 2.3 | 7 | 9 |
1000 | 0.3 | 0.4 | 0.5 | 1.3 | 2.3 | 3.5 | 11 | 17 |
10000 | 0.8 | 0.9 | 1.4 | 6 | 11 | 17 | 53 | 110 |
100000 | 2.5 | 7 | 7 | 47 | 99 | 152 | N/A | N/A |
1000000 | 26 | 61 | 107 | N/A | N/A | N/A | N/A | N/A |
Forecasting result table size in DB | 1 | 5 | 10 | 50 | 100 | 150 | 500 | 1000 |
---|---|---|---|---|---|---|---|---|
100 | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB |
1000 | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 95.8 kB | 95.8 kB |
10000 | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 957 kB | 957 kB |
100000 | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | 0.1 kB | N/A | N/A |
1000000 | 0.1 kB | 0.1 kB | 0.1 kB | N/A | N/A | N/A | N/A | N/A |
Remarksβ
- The data used 4 numbers precision (e.g. 0.582).
- Some datasets exceed 100 MB storage size in the database because we include db indices.
- The computing time increases with bigger forecasting horizons, however, the increase is smaller than linear.
- The output table size is only relevant for backtesting tasks. It scales up with bigger forecasting horizon and down with bigger rolling window.
- As the memory usage approaches 100 percent, TIM starts to preprocess the data by throwing rows away and switching off features which results into smaller numbers all across the tables after that breaking point (RAM, CPU and forecasting output table size). This is why the numbers may not always rise in the axis directions. The numbers where such preprocessing took place are denoted in italic and in this benchmark only polynomial features were switched off.
- Some fields are not filled because the respective dataset would be bigger than the 100 MB threshold.
- The CPU Load is expressed per CPU. i.e.. 400% 4 X 100% / Core
- The performance figures are for sequential execution of the ML requests without scaling and spinning up more TIM Workers.
Scaling the workersβ
What do you do if you need more transaction per hour?
The TIM Engine provides queueing and automatically spins up new TIM workers to cater for the volume of request being handled.
How to calculate the size and pricing of your infrastructure?β
The benchmark figures give you an indication what the performance will be in your use case. You need to determine the profile of ML request you need and calculate the number of TIM Workers you will need.
In this table, we give an example of a calculation:
Component | Sizing Consideration | Cost | Costing Example |
---|---|---|---|
TIM Engine Fabric | This Kubernetes installed component ensure a REST Endpoint is available | We recommend 2 VM with 4 core CPU and 32 Memory for this. | 140 Euro / Month for 2 D3 Servers to support the cluster |
Queueing Service | Rabbit MQ is available a Kubernetes cluster deployment - alternately you can use a Platform service for this. | RabbitMQ service are available on AWS and Azure | Optional |
Database | This is the Postgres database service. | Azure SQL for Postgress - 130 Euro / Month | |
TIM Workers | The TIM workers are scalable component. | You can find the CPU load response time in the benchmark tables. This allows you to calculate the number of 4 Core/12 GB Servers you need. | 2 ACI containers for TIM Workers - 240 Euro per Months |
Total | 510 Euro / Month |
This is a two TIM worker configuration. Some Example throughputs:
- RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 1000 observations, 50 variables - 1,6 Sec Response Time - 2250 Transactions/Hour/Works = 4500 Transaction per hour for this configuration
- RTInstantML Scenario (/forecasting/forecast-jobs/build-model) - 10000 observations, 50 variables - 14 Sec Response Time - 257 Transactions/Hour = 514 Transaction per hour for this configuration
Notes:
- Do not forget to cater for data collection. The measurements in the tables above are processing time (Response Time) of the TIM Worker.
- the The prices are indicational and dependent on your plan with Azure.
- The Azure prices are based on 3 years upfront.
- Similar pricing is possible for AWS or on premise.
- You might want to consider servers with less cores if your cases does not benefit from parallelization over multiple cores.
Sizing And Estimation Methodologyβ
Estimating anything can be a complex and error-prone process. Thatβs why it's called an 'estimation', rather than a 'calculation'. There are three primary approaches to sizing a TIM InstantML implementation:
- Algorithm, or Calculation Based
- Size By Example Based
- Proof of Concept Based
Typical implementations of TIM InstantML do not required complex sizing and estimation processes. An algorithm based approach, taking into account the data size and the number of ML transaction per hour per worker, allows you to determine the number of parallel workers and design your architecture.
In more complex cases a Proof of Concept might be useful. This is typically the case with more complicated peak time ML consumption requirements.
Algorithm, Or Calculation Basedβ
An algorithm or process that accepts data input is probably the most commonly accepted tool for delivering sizing estimations. Unfortunately, this approach is generally the most inaccurate.
When considering a multiple model β multiple use case implementation, the number of variables involved in delivering a calculation that even approaches a realistic sizing response requires input values numbering in excess of one hundred, and calculations so complex and sensitive that providing an input value plus or minus 1% of the correct value results in wildly inaccurate results.
The other approach to calculation-based solutions is to simplify the calculation to the point where it is simple to understand and simple to use. This paper shows how this kind of simplification can provide us with a sizing calculator.
Size-By-Example Basedβ
A size-by-example (SBE) approach requires a set of known samples to use as data points along the thermometer of system size. The more examples available for SBE, the more accurate the intended implementation will be.
By using these real world examples, both customers and Tangent Works can be assured that the configurations proposed have been implemented before and will provide the performance and functionality unique to the proposed implementation. Tangent Works Engineering can help here.
Proof Of Concept Basedβ
A proof of concept (POC), or pilot based approach, offers the most accurate sizing data of all three approaches.
A POC you to do the following:
- Test your InstantML implementation design
- Test your chosen hardware or cloud platform
- Simulate projected load
- Validate design assumptions
- Validate Usage
- Provide iterative feedback for your implementation team
- Adjust or validate the implementation decisions made prior to the POC
There are, however, two downsides to a POC based approach, namely time and money. Running a POC requires the customer to have manpower, hardware, and the time available to implement the solution, validate the solution, iterate changes, re-test, and finally analyze the POC findings.
A POC is always the best and recommended approach for any sizing exercise. It delivers results that are accurate for the unique implementation of the specific customer that are as close to deploying the real live solution as possible, without the capital outlay on hardware and project resources.