Concrete compressive strength¶

Title: Concrete compressive strength
Author: Michal Bezak, Tangent Works
Industry: Construction
Area: Quality assurance
Type: Forecasting

Description¶

Concrete is one of the most important building materials used. It is a composite of fine and coarse aggregates bonded together with a fluid cement that hardens over time to form a solid mass.

Typically, it is reinforced with steel, nevertheless, ensuring its strength and durability is critical for stability of buildings. Because of this, concrete must pass quality control to ensure its strength.

Compressive strength is one of the parameters evaluated. In short, compressive strength is the capacity to withstand loads. It is measured on a universal testing machine. Concrete cubes are prepared from random samples, then after several days, they are taken to load testing.

It is known that concrete compressive strength can be modelled with a nonlinear function with age and ingredients as input parameters. We will demonstrate how to build models with TIM and see whether such models can be used to estimate compressive strength from historical data.

Business parameters¶

Business objective: Reduce operational costs
Business value: Increase efficiency
KPI: -
In [2]:
import logging
import pandas as pd
import plotly as plt
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
import json
import datetime

from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

import tim_client
In [3]:
with open('credentials.json') as f:
    credentials_json = json.load(f)                     # loading the credentials from credentials.json

TIM_URL = 'https://timws.tangent.works/v4/api'          # URL to which the requests are sent

SAVE_JSON = False                                       # if True - JSON requests and responses are saved to JSON_SAVING_FOLDER
JSON_SAVING_FOLDER = 'logs/'                            # folder where the requests and responses are stored

LOGGING_LEVEL = 'INFO'
In [4]:
level = logging.getLevelName(LOGGING_LEVEL)
logging.basicConfig(level=level, format='[%(levelname)s] %(asctime)s - %(name)s:%(funcName)s:%(lineno)s - %(message)s')
logger = logging.getLogger(__name__)
In [5]:
credentials = tim_client.Credentials(credentials_json['license_key'], credentials_json['email'], credentials_json['password'], tim_url=TIM_URL)
api_client = tim_client.ApiClient(credentials)

Dataset¶

Dataset contains measurement of key factors from concrete production process. Our goal is to calculate compressive strength. Outlier values were removed from original file (more than 40 records were reduced).

Sampling¶

Original dataset was enhanced with regular timestamps (with daily sampling rate) so it can be used with TIM. It is not expected to use any time related features (except age, however that has no link to sequential order of points on a timeline).

Data¶

Structure of CSV file:

Column name Description Type Availability
timestamp Timestamp Timestamp column
Concrete compressive strength MPa Target t-1
Cement kg/m3 Predictor t+0
Blast Furnace Slag kg/m3 Predictor t+0
Fly Ash kg/m3 Predictor t+0
Water kg/m3 Predictor t+0
Superplasticizer kg/m3 Predictor t+0
Coarse Aggregate kg/m3 Predictor t+0
Fine Aggregate kg/m3 Predictor t+0
Age Days Predictor t+0

Data situation¶

We want TIM to quantify current condition based on measurement (actual values), it means that we want to caculate value of target based on predictors values only, hence the last record of target must be kept empty (NaN/None) in dataset. This situation will be replicated by TIM to calculate results for all out-of-sample records.

CSV files used in experiments can be downloaded here.

Source¶

Original files were acquired at UCI Machine Learning Repostitory.

Reuse of database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the following published paper: I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

(C) Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University, Hsin Chu, Taiwan 30067, R.O.C., e-mail:icyeh '@' chu.edu.tw, TEL:886-3-5186511

In [6]:
data = tim_client.load_dataset_from_csv_file('data_concrete_strength_1312.csv', sep=',')

data.tail()
Out[6]:
timestamp Concrete compressive strength Cement Blast Furnace Slag Fly Ash Water Superplasticizer Coarse Aggregate Fine Aggregate Age
976 2017-10-06 21.86 255.0 0.0 0.0 192.0 0.0 889.8 945.0 90
977 2017-10-07 44.87 336.5 0.0 0.0 181.9 3.4 985.8 816.8 28
978 2017-10-08 36.84 387.0 20.0 94.0 157.0 14.3 938.0 845.0 7
979 2017-10-09 48.15 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 100
980 2017-10-10 NaN 323.7 282.8 0.0 183.8 10.3 942.7 659.9 56
In [7]:
data.shape
Out[7]:
(981, 10)
In [8]:
target_column = 'Concrete compressive strength'

timestamp_column = 'timestamp'
In [9]:
prediction_horizon = 1

Visualization¶

In [10]:
fig = go.Figure()
   
fig.add_trace(go.Scatter( x = data.index, y = data['Concrete compressive strength'], name='Concrete compressive strength' ) )
fig.add_trace(go.Scatter( x = data.index, y = data['Water'], name='Water' ) )
fig.add_trace(go.Scatter( x = data.index, y = data['Age'], name='Age' ) )

fig.update_layout( height = 700, width = 1200, title='Dataset' )

fig.show()

Engine settings¶

Parameters that need to be set:

  • predictionTo that defines prediction horizon is set to 1 as we frame the problem as evaluation of current situation based on values of predictors only.
  • backtestLength - defines length of out-of-sample interval.
  • allowOffsets set to False will switch off use of lagged values completely, for both target and predictors, we treat values as discrete.
  • features - list features/dictionaries to be used for model building, as there is no time-specific information in data, and offsets are switched off, we can narrow down the list to: "Intercept" ,"Polynomial", "Identity".

We also ask for additional data from engine to see details of sub-models so we define extendedOutputConfiguration parameter as well.

In [11]:
backtest_length = int( data.shape[0] * .2)

backtest_length
Out[11]:
196
In [12]:
engine_settings = {
    'usage': {                                 
        'predictionTo': { 
            'baseUnit': 'Sample',              
            'offset': 1                 
        },
        'backtestLength': backtest_length     
    },
    'allowOffsets': False,
    'features': ["Intercept" ,"Polynomial", "Identity"],
    'extendedOutputConfiguration': {
        'returnExtendedImportances': True
    }
}

Experiment iteration(s)¶

In [13]:
backtest = api_client.prediction_build_model_predict(data, engine_settings)  

backtest.status   
Out[13]:
'Finished'
In [14]:
backtest.result_explanations
Out[14]:
[]

Insights - inspecting ML models¶

Simple and extended importances are available for you to see to what extent each predictor contributes to explanation of variance of target variable.

In [15]:
simple_importances = pd.DataFrame.from_dict( backtest.predictors_importances['simpleImportances'], orient='columns' )

simple_importances
Out[15]:
importance predictorName
0 34.70 Cement
1 27.87 Age
2 19.90 Blast Furnace Slag
3 10.13 Superplasticizer
4 3.90 Water
5 3.50 Fly Ash
In [16]:
fig = go.Figure()

fig.add_trace(go.Bar( x = simple_importances['predictorName'],
                      y = simple_importances['importance'] ) )

fig.update_layout( title='Simple importances')

fig.show()
In [17]:
extended_importances_temp = backtest.predictors_importances['extendedImportances']
extended_importances_temp = sorted( extended_importances_temp, key = lambda i: i['importance'], reverse=True ) 
extended_importances = pd.DataFrame.from_dict( extended_importances_temp )

extended_importances
Out[17]:
time type termName importance
0 [1] Predictor Cement 28.40
1 [1] Interaction Age & Blast Furnace Slag 23.67
2 [1] Interaction Age & Superplasticizer 20.26
3 [1] Predictor Blast Furnace Slag 8.07
4 [1] Interaction Cement & Water 7.79
5 [1] Interaction Age & Fly Ash 7.00
6 [1] Interaction Cement & Age 4.82
7 [1] TargetAndTargetTransformation Intercept 0.00
In [18]:
fig = go.Figure()

fig.add_trace(go.Bar( x = extended_importances[ extended_importances['time'] == '[1]' ]['termName'],
                      y = extended_importances[ extended_importances['time'] == '[1]' ]['importance'] ) )

fig.update_layout(
        title='Features generated from predictors used by model',
        height = 700
)

fig.show()

Evaluation of results¶

Results for out-of-sample interval.

In [19]:
def build_evaluation_data( backtest, data ):
    out_of_sample_predictions = backtest.aggregated_predictions[1]['values']
    out_of_sample_predictions.rename( columns = {'Prediction':target_column+'_pred'}, inplace=True)
    out_of_sample_timestamps = out_of_sample_predictions.index.tolist()

    evaluation_data = data.copy()

    evaluation_data[ timestamp_column ] = pd.to_datetime(data[ timestamp_column ]).dt.tz_localize('UTC')
    evaluation_data = evaluation_data[ evaluation_data[ timestamp_column ].isin( out_of_sample_timestamps ) ]

    evaluation_data.set_index( timestamp_column,inplace=True)
    evaluation_data = evaluation_data[ [ target_column ] ]
    
    evaluation_data = evaluation_data.join( out_of_sample_predictions )

    return evaluation_data
In [20]:
def plot_results( e ):
    fig = go.Figure()
   
    fig.add_trace(go.Scatter( x = e.index, y = e.iloc[:,1], name=e.columns[1] ) )
    fig.add_trace(go.Scatter( x = e.index, y = e.iloc[:,0], name=e.columns[0] ) )

    fig.update_layout( height = 700, width = 1200, title='Actual vs. predicted' )

    fig.show()
In [21]:
backtest.aggregated_predictions[1]['accuracyMetrics']
Out[21]:
{'MAE': 6.549382750637635,
 'MSE': 69.52513585340209,
 'MAPE': 26.306591973733735,
 'RMSE': 8.338173412288935}
In [22]:
e = build_evaluation_data( backtest, data ).reset_index(drop=True)

plot_results(e)

Summary¶

We demonstrated how to build models with TIM even for data that are not (by definition) time series. Our dataset is rather short, for out-of-sample interval, MAPE metric shows 26%(1)

Also, to evaluate results for this class of use cases, it is recommended to evaluate how often prediction goes in unfavorable direction - i.e. predicted value is higher than actual value.

Residuals for some samples, e.g. at index 34, or 96 are unusually big, hence the question - could those be anomalies/issues in data or, is there confounding variable (other factor that may influence target) that is not present in dataset.

With this dataset, it is possible to frame the problem as anomaly detection or classification as well. TIM has built-in capability to solve both types of tasks.


(1) This result is representative; 30 other iterations run with rows shuffled randomly; MAPE metric was in range 22.7% - 33.9%.