Concrete is one of the most important building materials used. It is a composite of fine and coarse aggregates bonded together with a fluid cement that hardens over time to form a solid mass.
Typically, it is reinforced with steel, nevertheless, ensuring its strength and durability is critical for stability of buildings. Because of this, concrete must pass quality control to ensure its strength.
Compressive strength is one of the parameters evaluated. In short, compressive strength is the capacity to withstand loads. It is measured on a universal testing machine. Concrete cubes are prepared from random samples, then after several days, they are taken to load testing.
It is known that concrete compressive strength can be modelled with a nonlinear function with age and ingredients as input parameters. We will demonstrate how to build models with TIM and see whether such models can be used to estimate compressive strength from historical data.
Business objective: | Reduce operational costs |
Business value: | Increase efficiency |
KPI: | - |
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json
import datetime
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math
import tim_client
with open('credentials.json') as f:
credentials_json = json.load(f) # loading the credentials from credentials.json
TIM_URL = 'https://timws.tangent.works/v4/api' # URL to which the requests are sent
SAVE_JSON = False # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/' # folder where the requests and responses are stored
LOGGING_LEVEL = 'INFO'
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)
Dataset contains measurement of key factors from concrete production process. Our goal is to calculate compressive strength. Outlier values were removed from original file (more than 40 records were reduced).
Original dataset was enhanced with regular timestamps (with daily sampling rate) so it can be used with TIM. It is not expected to use any time related features (except age, however that has no link to sequential order of points on a timeline).
Structure of CSV file:
Column name | Description | Type | Availability |
---|---|---|---|
timestamp | Timestamp | Timestamp column | |
Concrete compressive strength | MPa | Target | t-1 |
Cement | kg/m3 | Predictor | t+0 |
Blast Furnace Slag | kg/m3 | Predictor | t+0 |
Fly Ash | kg/m3 | Predictor | t+0 |
Water | kg/m3 | Predictor | t+0 |
Superplasticizer | kg/m3 | Predictor | t+0 |
Coarse Aggregate | kg/m3 | Predictor | t+0 |
Fine Aggregate | kg/m3 | Predictor | t+0 |
Age | Days | Predictor | t+0 |
We want TIM to quantify current condition based on measurement (actual values), it means that we want to caculate value of target based on predictors values only, hence the last record of target must be kept empty (NaN/None) in dataset. This situation will be replicated by TIM to calculate results for all out-of-sample records.
CSV files used in experiments can be downloaded here.
Original files were acquired at UCI Machine Learning Repostitory.
Reuse of database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).
(C) Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C., e-mail:icyeh '@' chu.edu.tw, TEL:886-3-5186511
data = tim_client.load_dataset_from_csv_file('data_concrete_strength_1312.csv', sep=',')
data.tail()
timestamp | Concrete compressive strength | Cement | Blast Furnace Slag | Fly Ash | Water | Superplasticizer | Coarse Aggregate | Fine Aggregate | Age | |
---|---|---|---|---|---|---|---|---|---|---|
976 | 2017-10-06 | 21.86 | 255.0 | 0.0 | 0.0 | 192.0 | 0.0 | 889.8 | 945.0 | 90 |
977 | 2017-10-07 | 44.87 | 336.5 | 0.0 | 0.0 | 181.9 | 3.4 | 985.8 | 816.8 | 28 |
978 | 2017-10-08 | 36.84 | 387.0 | 20.0 | 94.0 | 157.0 | 14.3 | 938.0 | 845.0 | 7 |
979 | 2017-10-09 | 48.15 | 168.9 | 42.2 | 124.3 | 158.3 | 10.8 | 1080.8 | 796.2 | 100 |
980 | 2017-10-10 | NaN | 323.7 | 282.8 | 0.0 | 183.8 | 10.3 | 942.7 | 659.9 | 56 |
data.shape
(981, 10)
target_column = 'Concrete compressive strength'
timestamp_column = 'timestamp'
prediction_horizon = 1
fig = go.Figure()
fig.add_trace(go.Scatter( x = data.index, y = data['Concrete compressive strength'], name='Concrete compressive strength' ) )
fig.add_trace(go.Scatter( x = data.index, y = data['Water'], name='Water' ) )
fig.add_trace(go.Scatter( x = data.index, y = data['Age'], name='Age' ) )
fig.update_layout( height = 700, width = 1200, title='Dataset' )
fig.show()
Parameters that need to be set:
We also ask for additional data from engine to see details of sub-models so we define extendedOutputConfiguration parameter as well.
backtest_length = int( data.shape[0] * .2)
backtest_length
196
engine_settings = {
'usage': {
'predictionTo': {
'baseUnit': 'Sample',
'offset': 1
},
'backtestLength': backtest_length
},
'allowOffsets': False,
'features': ["Intercept" ,"Polynomial", "Identity"],
'extendedOutputConfiguration': {
'returnExtendedImportances': True
}
}
backtest = api_client.prediction_build_model_predict(data, engine_settings)
backtest.status
'Finished'
backtest.result_explanations
[]
Simple and extended importances are available for you to see to what extent each predictor contributes to explanation of variance of target variable.
simple_importances = pd.DataFrame.from_dict( backtest.predictors_importances['simpleImportances'], orient='columns' )
simple_importances
importance | predictorName | |
---|---|---|
0 | 34.70 | Cement |
1 | 27.87 | Age |
2 | 19.90 | Blast Furnace Slag |
3 | 10.13 | Superplasticizer |
4 | 3.90 | Water |
5 | 3.50 | Fly Ash |
fig = go.Figure()
fig.add_trace(go.Bar( x = simple_importances['predictorName'],
y = simple_importances['importance'] ) )
fig.update_layout( title='Simple importances')
fig.show()
extended_importances_temp = backtest.predictors_importances['extendedImportances']
extended_importances_temp = sorted( extended_importances_temp, key = lambda i: i['importance'], reverse=True )
extended_importances = pd.DataFrame.from_dict( extended_importances_temp )
extended_importances
time | type | termName | importance | |
---|---|---|---|---|
0 | [1] | Predictor | Cement | 28.40 |
1 | [1] | Interaction | Age & Blast Furnace Slag | 23.67 |
2 | [1] | Interaction | Age & Superplasticizer | 20.26 |
3 | [1] | Predictor | Blast Furnace Slag | 8.07 |
4 | [1] | Interaction | Cement & Water | 7.79 |
5 | [1] | Interaction | Age & Fly Ash | 7.00 |
6 | [1] | Interaction | Cement & Age | 4.82 |
7 | [1] | TargetAndTargetTransformation | Intercept | 0.00 |
fig = go.Figure()
fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[1]' ]['termName'],
y = extended_importances[ extended_importances['time'] == '[1]' ]['importance'] ) )
fig.update_layout(
title='Features generated from predictors used by model',
height = 700
)
fig.show()
Results for out-of-sample interval.
def build_evaluation_data( backtest, data ):
out_of_sample_predictions = backtest.aggregated_predictions[1]['values']
out_of_sample_predictions.rename( columns = {'Prediction':target_column+'_pred'}, inplace=True)
out_of_sample_timestamps = out_of_sample_predictions.index.tolist()
evaluation_data = data.copy()
evaluation_data[ timestamp_column ] = pd.to_datetime(data[ timestamp_column ]).dt.tz_localize('UTC')
evaluation_data = evaluation_data[ evaluation_data[ timestamp_column ].isin( out_of_sample_timestamps ) ]
evaluation_data.set_index( timestamp_column,inplace=True)
evaluation_data = evaluation_data[ [ target_column ] ]
evaluation_data = evaluation_data.join( out_of_sample_predictions )
return evaluation_data
def plot_results( e ):
fig = go.Figure()
fig.add_trace(go.Scatter( x = e.index, y = e.iloc[:,1], name=e.columns[1] ) )
fig.add_trace(go.Scatter( x = e.index, y = e.iloc[:,0], name=e.columns[0] ) )
fig.update_layout( height = 700, width = 1200, title='Actual vs. predicted' )
fig.show()
backtest.aggregated_predictions[1]['accuracyMetrics']
{'MAE': 6.549382750637635, 'MSE': 69.52513585340209, 'MAPE': 26.306591973733735, 'RMSE': 8.338173412288935}
e = build_evaluation_data( backtest, data ).reset_index(drop=True)
plot_results(e)
We demonstrated how to build models with TIM even for data that are not (by definition) time series. Our dataset is rather short, for out-of-sample interval, MAPE metric shows 26%(1)
Also, to evaluate results for this class of use cases, it is recommended to evaluate how often prediction goes in unfavorable direction - i.e. predicted value is higher than actual value.
Residuals for some samples, e.g. at index 34, or 96 are unusually big, hence the question - could those be anomalies/issues in data or, is there confounding variable (other factor that may influence target) that is not present in dataset.
With this dataset, it is possible to frame the problem as anomaly detection or classification as well. TIM has built-in capability to solve both types of tasks.
(1) This result is representative; 30 other iterations run with rows shuffled randomly; MAPE metric was in range 22.7% - 33.9%.