The variable importance measure (VIM) can be implemented to rank or select important variables, which can effectively reduce the variable dimension and shorten the computational time. Random forest (RF) is an ensemble learning method by constructing multiple decision trees. In order to improve the prediction accuracy of random forest, advanced random forest is presented by using Kriging models as the models of leaf nodes in all the decision trees. Referring to the Mean Decrease Accuracy (MDA) index based on Out-of-Bag (OOB) data, the single variable, group variables and correlated variables importance measures are proposed to establish a complete VIM system on the basis of advanced random forest. The link of MDA and variance-based sensitivity total index is explored, and then the corresponding relationship of proposed VIM indices and variance-based global sensitivity indices are constructed, which gives a novel way to solve variance-based global sensitivity. Finally, several numerical and engineering examples are given to verify the effectiveness of proposed VIM system and the validity of the established relationship.
Variable importance measurerandom forestvariance-based global sensitivityKriging modelIntroduction
Sensitivity analysis can reflect the influence of input variables on the output response. The sensitivity analysis includes local sensitivity and global sensitivity analysis [1]. The local sensitivity can respond to the influence of input variables on the characteristics of output at the nominal value. The global sensitivity analysis, known as the importance measure analysis, can estimate the influence of input variables in the whole distribution region on the characteristics of output [2–4]. There are three kinds of importance measures: non-parametric measure, variance-based global sensitivity and moment-independent importance measure [1]. The variance-based global sensitivity is the most widely applied measure because it is generality and holistic, and it can give the contribution of group variables and the cross influence of different variables. There are plenty of methods to calculate variance-based global sensitivity indices, such as Monte Carlo (MC) simulation [5], high dimensional model representation (HDMR) [6], state-dependent parameter (SDP) procedure [7] and so on. MC simulation can estimate the approximate exact solution of total and main sensitivity indices simultaneously, but the amount of calculation is generally large, especially for high dimensional engineering problems. HDMR and SDP can calculate the main sensitivity indices by solving all order components of input-output surrogate models.
Random forest (RF) is composed by multiple decision trees (DTs), it is an ensemble learning method proposed by Breiman [8]. RF has many advantages, such as strong robustness, good tolerance to outliers and noise. RF has a wide range of application prospects, such as geographical energy [9], chemical industry [10], health insurance [11] and data science competitions. RF can not only deal with classification and regression problems but also analyze the critical measure. RF provides two kinds of importance measures: Mean Decrease Impurity (MDI) based on the Gini index and Mean Decrease Accuracy (MDA) based on Out-of-Bag (OOB) data [12]. MDI index is the average reduction of Gini impurity due to a splitting variable in the decision tree across RF [13]. MDI index is sensitive to variables with different scales of measurement and shows artificial inflation for variables with various categories. For correlated variables, the MDI index is related to the selection sequence of variables. Once a variable is selected, the impurity will be reduced by the first selected variable. It is difficult for the other correlated variables to reduce the same magnitude of impurity, so the importance of the other correlated variables will be decline. MDA index is the average reduction of prediction accuracy after randomly permuting OOB data [14,15]. Since MDA index can measure the impact of each variable on the prediction accuracy of RF model and have no biases, it has been widely used in many scientific areas. Although there are importance measures based on RF to distinguish the important features, there is no complete importance measure system to deal with nonlinearity and correlation among variables [16,17]. In addition, the similarity analysis process of MDA based on OOB data and Monte Carlo simulation of variance-based global sensitivity can be used as a breakthrough point to find their link [18]. With the help of variance-based sensitivity index system, the construction of variable importance measure system based on RF can be realized.
By comparing the procedure of estimating the total sensitivity indices and the MDA index based on OOB data, a complete VIM system is established based on advanced RF by using Kriging models, including single variable, group variables and correlated variables importance measure indices. The proposed VIM system combines the advantages of random forest and Kriging model. The VIM system can indicate the contribution of input variables to output response and rank important variables, and also give a novel way to solve variance-based global sensitivity with small samples.
This paper is organized as follows: Section 2 reviews the basic concept of variance-based global sensitivity. Section 3 reviews random forest firstly, presents MDA index and then proposes single variable, group variables and correlated variables importance measures respectively. Section 4 finds the link between MDA index and total variance-based global sensitivity index, and the relationship between VIM indices and variance-based global sensitivity indices is derived. In Section 5, several numerical and engineering examples are provided before the conclusions in Section 6.
Variance-Based Global Sensitivity
The variance-based global sensitivity, proposed by Sobol [19], reflects the influence of input variables in the whole distribution region on the variance of model output. The variance-based global sensitivity indices not only have strong model generality, but also can discuss the importance of group variables and quantify the interaction between input variables. ANOVA (Analysis of Variance) decomposition is the basic of variance-based global sensitivity analysis.
ANOVA Decomposition
Response function Y=g(X) exists a unique ANOVA decomposition as follows:
where n is the dimension of input variables, g_{0} is the expectation of g(X), g0=∫ Rng(x)∏i=1n[fXi(xi)dxi], and fXi(xi) is the probability density function of variable X_{i}. The components in Eq. (1) are:
gi(Xi)=∫ Rn-1g(x)∏j≠in[fXj(xj)dxj]-g0gij(Xi,Xj)=∫ Rn-2g(x)∏k≠i,jn[fXk(xk)dxk]-gi(Xi)-gj(Xj)-g0
Variance-Based Global Sensitivity Indices
The variance of response function can be expressed as:
V=Var(Y)=∫ Rng2(x)∏i=1n[fXi(xi)dxi]-g02
Since the decomposition terms are orthogonal, the variance of the response function is the sum of variances of all individual decomposition terms:
V=∑i=1nVi+∑1≤i<j≤nVij+…+V1,2,…,n
where
Then the ratio of each variance component to variance of response function can reflect the variance contribution of each component, i.e., S_{i} = V_{i}/V, Sij=Vij/V⋯
S_{i} = V_{i}/V is the first order sensitivity index of variable X_{i} (also name S_{i} as main sensitivity index), it can reflect the influence of variable X_{i} on the response Y. S_{ij} = V_{ij}/V is the second order sensitivity index, it can reflect the interaction influence of variables X_{i} and X_{j} on the response Y. The total sensitivity index SiT can be obtained by summing all the influence related to variable X_{i}:
SiT=Si+∑1≤i<j≤nSij+∑1≤i<j<k≤nSijk+…+S12…n
According to probability theory, the variance-based global sensitivity indices can be expressed as [20]:
Si=Var[E(Y∣Xi)]Var(Y)Sij=Var[E(Y∣XiXj)]Var(Y)SiT=Var(Y)-Var[E(Y∣X~i)]Var(Y)=1-Var[E(Y∣X~i)]Var(Y)
where X~i indicates variable vector without X_{i}.
Simulation of Variance-Based Global Sensitivity Indices
Due to the enormous computational load, the traditional double-loop Monte Carlo simulation is not suitable for complex engineering problems [21]. The computational procedures of single-loop Monte Carlo simulation are listed as follows:
Step 1: Randomly generate two sample matrices A and B based on the probability distribution of variables X.
A=[x11⋯xi1⋯xn1⋮⋮⋮⋮⋮x1N⋯xiN⋯xnN]N×n,B=[x1(N+1)⋯xi(N+1)⋯xn(N+1)⋮⋮⋮⋮⋮x1(N+N)⋯xi(N+N)⋯xn(N+N)]N×n
Step 2: Construct sample matrix Ci, where the ith column of Ci comes from the ith column of A, and the other columns come from the corresponding columns of B.
Ci=[x1(N+1)⋯xi1⋯xn(N+1)⋮⋮⋮⋮⋮x1(N+N)⋯xiN⋯xn(N+N)]N×n
Step 3: The main and total sensitivity indices can be expressed as follows:
Si=1N∑j=1NyAjyCij-g02Var(Y)
SiT=1-1N∑j=1NyBjyCij-g02Var(Y)
where yA=[yA1,…,yAN], yB=[yB1,…,yBN], yCi=[yCi1,…,yCiN] are the model outputs with the input matrices A, B and Ci respectively. The computational cost of single-loop Monte Carlo simulation is (n+2)×N.
Variable Importance Measure System Based on Random Forest
RF is an ensemble statistical learning method to deal with classification and regression problems [22]. Bootstrap sampling technique is firstly carried out to extract training samples from the original data, and these training samples are used to build a decision tree; the rest Out-of-Bag data are used to verify the accuracy of established decision tree.
Random forest
There are M established decision trees by employing Bootstrap sampling technique M times. All decision trees are used to compose a random forest (shown in Fig. 1). And the final prediction results of RF are obtained by voting in the classification model or taking the mean in the regression model [23]. And the prediction precision of RF can be expressed by mean square error square error (MSE) between predicted values and true values of OOB data.
Bootstrap technique can extract training points to build a decision tree h_{m}(m=1,2,…,M) and the corresponding OOB data of input XOOB and output y. The decision tree h_{m} is used to predict the forecast response ym of XOOB. The MSE of decision tree h_{m} is εm=mean(ym-y)2. Obtain the MSEs of all decision trees εm(m=1,2,…,M), the average will be the total predicted error of RF model [24]:
MSE=1M∑m=1Mεm
In order to improve the prediction precision of RF, a high-precision Kriging model is used as the model of leaf nodes in the decision tree, replacing the original average or linear regression. Next, a nonlinear discontinuous function is used to verify the prediction accuracy of Kriging model and linear regression model of decision tree.
Y={-X2+10cos(2πX)-30X<0X2-10cos(2πX)+30X≥0
where the input variable X is uniformly distributed on [-π, π].
A comparison of Kriging based decision tree (abbreviated as Kriging-DT) and linear regression based decision tree (abbreviated as Linear-DT) for prediction data are shown in Fig. 2. With the increase of training samples, the predicted errors of Kriging-DT and linear-DT are shown in Fig. 3. And it can be found that Kriging-DT can better approximate the original function. For the same training samples, Kriging-DT has higher prediction accuracy and faster decline rate of predicted error than Linear-DT. Kriging-DT inherits the advantages of Kriging model and has good applicability for nonlinear piecewise function.
Comparsion of Kriging-DT, Linear-DT and predict data with 64 training samples
Predicted errors of Kriging-DT and Linear-DT <italic>vs.</italic> size of training samples
There are two kinds of importance measures based on RF: Mean Decrease Impurity (MDI) based on Gini index and Mean Decrease Accuracy (MDA) based on OOB data. MDA index is widely used to rank important variables on the prediction accuracy of RF model [12].
Mean Decrease Accuracy Index of Random Forest
MDA index is the average reduction of prediction accuracy after randomly permuting OOB data. Permuting the order of variable in OOB data, the corresponding relationship between the OOB sample and output will be destroyed. The prediction accuracy will be calculated after each permutation. The MSE between the paired predictions is taken as the importance measure.
For the decision tree h_{m}(m=1,2,…,M), the corresponding OOB input data is matrix XOOB=(XOOB1,…,XOOBi,…,XOOBn), XOOBi is the ith column of matrix XOOB. Permute the order of XOOBi, decision tree h_{m} can obtain the new forecast response ymi. The MSE of predicted values is εmi=mean(ymi-ym)2. Obtain the influence of variable X_{i} in all decision trees (ε1i,ε2i,…,εMi), the average of εmi(m=1,2,…,M) is the total impact of variable X_{i} based on the RF model:
ηiT=1M∑m=1Mεmi
The subscript m of εmi and ymi is the number of decision tree h_{m}(m=1,2,…,M), and the superscript i of εmi and ymi indicates that the ith column of XOOB is in disorder, corresponding to the variable X_{i}.
Based on the procedure of MDA index, the single variable, group variables and correlated variables importance measures are expanded to establish the variable importance measure system.
Single Variable Importance Measure of Random Forest
For the decision tree h_{m}(m=1,2,…,M), the order of OOB input data XOOB=(XOOB1,…,XOOBi,…,XOOBn) is randomly permuted expected XOOBi, that is to say, the value of variable X_{i} is fixed, and the values of the other variables are randomly permuted. Then the decision tree can predict the modified OOB samples to get the predicted values ym~i, the MSE of predicted values is εm~i=mean(ym~i-ym)2. Obtain the influence of variable X_{i} in all decision trees, the average of εm~i is the main impact of variable X_{i} based on the RF model:
ηi=1M∑m=1Mεm~i
The superscript ~i of εm~i and ym~i indicates that the OOB data are permuted, expect for the ith columns.
Group Variable Importance Measure of Random Forest
The MDA index of group variables can be presented as follows. In the process of permuting OOB data, the values of variables X_{i} and X_{j} are fixed, and the values of the other variables are permuted. The decision tree can predict the modified OOB samples to get the predicted values ym~i,j, the MSE of predicted values is εm~i,j=mean(ym~i,j-ym)2. Obtain the influence of group variables [X_{i}, X_{j}] in all decision trees, the average of εm~i,j is the main impact of group variables [X_{i}, X_{j}] based on the RF model:
ηij=1M∑m=1Mεm~i,j
The superscript ~i, j of εm~i,j and ym~i,j indicates that the OOB data are permuted, expect for the ith and jth columns.
Correlated Variable Importance Measure of Random Forest
With the past years, several techniques based on RF are proposed to measure the importance of the correlated variables [25,26]. However, these researches directly use the independent importance measure techniques to estimate the importance of the correlated variables, which is not reasonable. Reference [27,28] divided the variance-based sensitivity indices into correlated contribution and independent contribution. Moreover, sparse grid integration (SGI) is carried out to perform importance analysis for correlated variables [29]. In the paper, the correlation of correlated variables is considered in the process of the RF importance measure. The necessary procedure of a single decision tree of the RF model for estimating the VIM consists of the following steps:
Step 1: Estimate the covariance matrix CX and mean vector μX from the original data X=(X1,…,Xi,…,Xn);
Step 2: Randomly extract the OOB data XOOB=(XOOB1,…,XOOBi,…,XOOBn) from the original data and use the other data to build the decision tree h_{m}(m=1,2,…,M). Use the decision tree h_{m} to predict the corresponding OOB data, and the prediction is ym;
Step 3: Split the matrix XOOB into two parts: vector XOOBi and matrix XOOB~i;
Step 4: Generate a new matrix X~i∣i and vector Xi∣~i based on XOOBi and XOOB~i, respectively. The mean vectors and covariance matrixes are different from the original μX and CX, the new ones should be used in the transformation process. For the multivariate normal distribution, μ~i∣i, μi∣~i, C~i∣i and Ci∣~i can be acquired as follows:
The mean vector μX and covariance matrix CX of X can be separated as μX=[μ~i,μi] and CX=[C~iC~i,iCi,~iCi]. The conditional mean vector and covariance matrix can be obtained by the following formulas [30]:
μ~i∣i=μ~i+C~i,iCi-1(Xi-μi)μi∣~i=μi+Ci,~iC~i-1(X~i-μ~i)C~i∣i=C~i-C~i,iCi-1Ci,~iCi∣~i=Ci-Ci,~iC~i-1C~i,i
After obtaining the corresponding μ~i∣i, μi∣~i, C~i∣i and Ci∣~i, Nataf transform can be employed to extract normal correlation samples X~i∣i and Xi∣~i directly.
Step 5: Combine matrix X~i∣i with vector XOOBi as the new matrix XOOBnewi=(X~i∣i1,…,X~i∣ii-1,XOOBi,X~i∣ii+1,…,X~i∣in), while combine vector with the matrix XOOB~i as XOOBnew~i=(XOOB1,…,XOOBi-1,Xi∣~i,XOOBi+1,…,XOOBn);
Step 6:XOOBnewi and XOOBnew~i are passed down the decision tree and the predicted values ymi and ym~i are computed, respectively. εmi and εm~i of the correlated variables can be calculated by the following formula:
εm~i=mean(ym~i-ym)2εmi=mean(ymi-ym)2
Obtain the influence of variable X_{i} in all decision trees, the averages of εm~i and εmi(m=1,2,…,M) are the main and total impact of variable X_{i} on the RF model.
The importance measure indices in correlated space and independent space are all given based on RF, which will establish the complete VIM system.
Link between VIM of RF and Variance-Based Global Sensitivity
The similarity analysis process of MDA index εmi based on OOB data and single-loop Monte Carlo simulation of variance-based global sensitivity can be used as a breakthrough point to find their link. The relationship between MDA index and variance-based global sensitivity can be explored firstly.
When the sample size is large, 1N∑j=1N(ym,ji)2 asymptotically equals 1N∑j=1N(ym,j)2, they are both second-order moment estimators of output response Y.
The total sensitivity index of single-loop Monte Carlo numerical simulation is:
Thus, the relationship between MDA index of RF importance measure and variance-based global sensitivity indices is explored. εmi can indicate the total impact of variable X_{i} on output performance. The larger εmi is, the larger SiT is, which means that the total contribution of variable on output performance is larger.
2) The main variance-based sensitivity index S_{i} of single-loop Monte Carlo numerical simulation is equivalent to:
Eq. (13) shows the relationship between εm~i and the main variance-based sensitivity index S_{i}. Index εm~i can indicate the main impact of variable X_{i} on output performance. The larger εm~i is, the smaller S_{i} is, which means that the main contribution of variable on output performance is smaller.
3) The relationship of variance-based sensitivity index of group variables S[i,j] and εm~i,j can be expressed as:
S[i,j]=1-εm~i,j2×Var(Y)
The influence of group variables [Xi,Xj] on the variance of output S[i,j] is composed of the main sensitivity indices S_{i}, S_{j} and second order sensitivity index S_{ij}.
S[i,j]=Si+Sj+Sij
Combining Eqs. (13)–(15), the second-order variance sensitivity index can be derived:
Sij=εm~i+εm~j-εm~i,j2×Var(Y)-1
So far, the MDA index, single variable index and group variables index are all proposed in the independent variable space.
4) In the correlated variable space, Var(Y)≠Var(ym~i)≠Var(ymi), Eqs. (11) and (13) should be changed into the following formulas:
Si=1-εm~i-E(ym~i)2+E(ym)22×Var(Y)
SiT=εmi-E(ymi)2+E(ym)22×Var(Y)
S_{i} contains the independent contribution of variable X_{i} and the correlated contribution of Pearson correlation coefficient, while SiT consists of the independent contribution by variable itself and interaction contribution with other variables.
Examples and DiscussionNumerical Example 1: Ishigami Function
Ishigami function is considered:
Y= sin(X1)+7sin2(X2)+0.1X34 sin(X1)
where X_{i} are uniformly distributed on the interval [-π,π], and the variables are independent. Ishigami function is a highly nonlinear function. For variable X_{2}, the convergence trends of importance measures with the number of sample points by Monte Carlo simulation and RF are shown in Fig. 4. There are 500 decision trees in the RF model. Tabs. 1 and 2 show the VIM results of single variable and group variables respectively. The analytical results (Si(Ana), SiT(Ana) and Sij(Ana)) are also presented in Tabs. 1 and 2 for comparison.
The convergence trends of the important measures with sample size (a) The convergence trend of MC simulation (b) The convergence trend of RF model
The single variable VIMs of Ishigami function
ηi
ηi⇒Si
Si(Ana)
Error
ηiT
ηiT⇒SiT
SiT(Ana)
Error (%)
X_{1}
18.997
0.314
0.314
–
15.359
0.555
0.558
0.54
X_{2}
15.316
0.447
0.442
1.13%
12.331
0.445
0.442
0.68
X_{3}
27.784
0.003
0.000
–
6.690
0.242
0.244
0.82
The group variables VIMs of Ishigami function
ηij
ηij⇒Sij
Sij(Ana)
Error
X_{1}X_{2}
6.698
0.003
0.000
–
X_{1}X_{3}
12.413
0.241
0.244
1.23%
X_{2}X_{3}
15.364
0.002
0.000
–
In all VIMs results tables, ηiT⇒SiT, ηi⇒Si and ηij⇒Sij mean that importance measures in this column are derived from Eqs. (11), (13) and (16), respectively.
There are 5×1020 random samples in single-loop Monte Carlo simulation to achieve the required accuracy, RF model only needs 10^{3} samples (seen from Fig. 4). The comparison shows that RF method has faster convergence. The MDA indices of RF can get the variance-based sensitivity indices consistent with the analytical solutions (seen from Tabs. 1 and 2), which suggests the RF model provides high accuracy. For the Ishigami function, the third-order sensitivity index S_{123} = 0, so the relationship of the variance-based sensitivity indices is SiT=Si+∑j≠iSij, which has a good agreement with the VIM estimators.
Numerical Example 2: Linear Function with Correlated Variables
A linear model is considered [28]:
Y=X1+X2+X3
where X_{i} are normally distributed with μX=[0,0,0] and covariance matrix CX=[10001ρσ0ρσσ2]. Analytical solutions for the main and total sensitivity indices can be calculated as:
There are 500 decision trees and 600 samples used to analyze the importance measures. Fig. 5 shows the importance measures of the correlated input variables with different ρs. Tab. 3 shows the importance measures of independent and correlated variables cases at σ=2. Additionally, the analytical solutions are also presented for comparison.
The importance measures of correlated input variables at different correlation coefficients (a) Importance measures <italic>vs.</italic> correlation coefficients (b) <inline-formula id="ieqn-167"><!--<alternatives><inline-graphic xlink:href="ieqn-167.png"/><tex-math id="tex-ieqn-167"><![CDATA[$S_{i}-S_{i}^{T}$]]></tex-math>--><mml:math id="mml-ieqn-167"><mml:msub><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup></mml:math><!--</alternatives>--></inline-formula> <italic>vs.</italic> correlation coefficientsThe single variable VIMs of Example 5.2
ρ
ηi
ηi⇒Si
Si(Ana)
Error
ηiT
ηiT⇒SiT
SiT(Ana)
Error
0
X_{1}
9.909
0.163
0.167
2.39%
1.957
0.166
0.167
0.60%
X_{2}
9.921
0.162
0.167
2.99%
1.975
0.168
0.167
0.60%
X_{3}
3.930
0.667
0.667
–
7.918
0.669
0.667
0.30%
0.5
X_{1}
14.031
0.124
0.125
0.80%
1.685
0.123
0.125
1.60%
X_{2}
8.964
0.498
0.500
0.40%
1.742
0.094
0.094
—
X_{3}
3.423
0.781
0.781
–
7.277
0.370
0.375
1.33%
−0.5
X_{1}
6.440
0.244
0.250
2.40%
1.707
0.252
0.250
0.80%
X_{2}
8.927
0.001
0.000
–
1.745
0.190
0.188
1.06%
X_{3}
3.444
0.555
0.563
1.42%
7.248
0.754
0.750
0.53%
0.8
X_{1}
16.527
0.102
0.109
6.42%
1.292
0.106
0.109
2.75%
X_{2}
7.330
0.739
0.735
0.54%
1.344
0.039
0.039
—
X_{3}
2.624
0.856
0.852
0.47%
6.008
0.150
0.157
4.46%
−0.8
X_{1}
4.765
0.356
0.357
0.28%
1.298
0.360
0.357
0.84%
X_{2}
7.389
0.129
0.129
–
1.355
0.126
0.129
2.33%
X_{3}
2.659
0.511
0.514
0.58%
6.012
0.504
0.514
1.95%
All the importance measures for correlated variables and independent ones are simulated. From the analytical results of main and total sensitivity indices, it can be found that SiT≤Si if ρ≥0 or ρ≤-2σσ2+1. The interaction sensitivity indices are all equal to zero, so Si-SiT only contain the correlated contribution by the Pearson correlation coefficients. For variable X_{1}, the main sensitivity index S_{1} is equal to total indices S1T and S1-S1T=0, because of the independence of the variable X_{1} with other variables. For the variables X_{2} and X_{3}, S2-S2T=S3-S3T, which suggests that the correlated contribution is generated from Pearson correlation coefficients.
Numerical Example 3: Nonlinear Function with Correlated Variables
Consider a nonlinear model Y = X_{1}X_{3} +X_{2}X_{4} [28], where X~N(μX,CX) with μX=[0,0,μ3,μ4] and covariance matrix CX=[σ12ρ12σ1σ200ρ12σ1σ2σ220000σ32ρ34σ3σ400ρ34σ3σ4σ42].
Analytical values of main and total sensitivity indices are:
where V=σ12(σ32+μ32)+σ22(σ42+μ42)+2ρ12σ1σ2(ρ34σ3σ4+μ3μ4).
Set μX=[0,0,250,400] and standard variance vector σ=[4,2,200,300]. There are 500 decision trees and 3000 samples to construct the RF model. Tab. 4 shows the VIMs results of group variables for the independent variable. The Pearson correlation coefficients are ρ12=0.3 and ρ34=-0.3. Tab. 5 shows the importance measures of single variable in the case of correlated and independent variable space.
The group variables VIMs of Example 5.3
X_{1}X_{2}
X_{1}X_{3}
X_{1}X_{4}
X_{2}X_{3}
X_{2}X_{4}
X_{3}X_{4}
ηij
1.931×106
1.975×106
3.206×106
3.905×106
3.207×106
5.171×106
ηij⇒Sij
0.000
0.242
0.002
0.004
0.137
0.008
The single variable VIMs of Example 5.3
ηi
ηi⇒Si
Si(Ana)
Error
ηiT
ηiT⇒SiT
SiT(Ana)
Error
Independent case
X_{1}
3.205×106
0.380
0.379
0.26%
3.223×106
0.623
0.621
0.32%
X_{2}
3.903×106
0.246
0.242
1.65%
1.977×106
0.382
0.379
0.79%
X_{3}
5.199×106
0.004
0.000
–
1.225×106
0.237
0.242
2.07%
X_{4}
5.188×106
0.002
0.000
–
7.063×105
0.137
0.136
0.74%
Correlated case
X_{1}
5.356×106
0.492
0.507
2.96%
1.835×106
0.490
0.492
0.41%
X_{2}
2.473×106
0.403
0.399
1.00%
4.319×106
0.333
0.300
11.0%
X_{3}
6.036×106
0.001
0.000
–
1.089×106
0.189
0.192
1.56%
X_{4}
5.924×106
0.000
0.000
–
6.938×105
0.108
0.108
–
Tabs. 4 and 5 show that analytical values and numerical simulation of VIMs have good consistency. In independent variable space, the third and fourth order sensitivity indices are all equal to zero, so the relationship of important measures of single variable and group variables are also SiT=Si+∑j≠iSij.
Engineering Example 4: Series and Parallel Electronic Models
Since the reliability of an electronic instrument in design stages has attracted much attention. Two simple electronic circuit models from reference [31] are used to get the VIMs. The series and parallel structures (shown in Fig. 6) are all considered in the importance measures. Each of the electronic circuit models contains four elements. The lifetime T_{i} independently obeys exponential distribution. The failure rate parameters are λ=[1,1/4.5,1/9,1/99], and the lifetime T of the models can be respectively expressed as:
Series model: T= min(T1,T2,T3,T4)
Parallel model: T= max(T1,T2,T3,T4)
The series and parallel electronic circuit structures (a) Series model (b) Parallel model
Tabs. 6 and 7 show the computational results of the importance measures by RF model, there are 500 decision trees and 15000 samples in the RF model. Due to the electronic circuit structures are discontinuous, more samples are needed to acquire the precise surrogate model and the importance measures. Additionally, the MC simulation results with 6×225 random samples are presented as approximate exact solutions Si(MC), SiT(MC) and Sij(MC) for comparison. From the comparison, the RF importance measures are also appropriate for the discontinuous model. The main sensitivity indices are almost equal to the total indices in the parallel model, while they have a significant difference in the series model (seen from Tab. 6). The second-order indices of series model are not equal to zero (seen from Tab. 7), which causes the VIMs difference between parallel model and series model.
The single variable VIMs of electronic models
ηi
ηi⇒Si
Si(MC)
ηiT
ηiT⇒SiT
SiT(MC)
Series model
T_{1}
0.429
0.607
0.593
0.942
0.864
0.853
T_{2}
0.993
0.090
0.090
0.308
0.282
0.284
T_{3}
1.048
0.039
0.043
0.158
0.145
0.153
T_{4}
1.090
0.001
0.004
0.005
0.004
0.0149
Parallel model
T_{1}
1.929×104
0.000
0.000
0.000
0.000
0.000
T_{2}
1.929×104
0.000
0.000
0.000
0.000
0.001
T_{3}
1.929×104
0.000
0.000
1.929×104
0.001
0.001
T_{4}
12.232
0.999
0.999
12.217
1.000
1.000
The group variables VIMs of series model
T_{1}T_{2}
T_{1}T_{3}
T_{1}T_{4}
T_{2}T_{3}
T_{2}T_{4}
T_{3}T_{4}
ηij
0.835
0.705
0.602
0.142
0.095
0.047
ηij⇒Sij
0.152
0.069
0.006
0.008
0.001
0.000
Sij(MC)
0.156
0.069
0.003
0.006
0.003
0.000
Engineering Example 5: A Cantilever Tube Model
A cantilever tube model (shown in Fig. 7) is used to analyze the variable importance measures. The model is a nonlinear model with six random variables. The input variables are outer diameter d, thickness t, external forces F_{1}, F_{2}, P and torsion T, respectively.
The cantilever tube model
The tensile stress σx and the torsion stress τzx can be analyzed:
σx=P+F1 sinθ1+F2 sinθ2A+MI,τzx=Td4I
where the sectional area A, the bending moment M and the inertia moment I can be calculated by the following formula:
A=π4[d2-(d-2t)2],M=F1L1 cosθ1+F2L2 cosθ2,I=π64[d4-(d-2t)4].
And the maximum stress of the cantilever can be calculated as σmax=σx2+3τzx2. All input variables t, d, F_{1}, F_{2}, P and T are normally distributed with parameters shown in Tab. 8. The Pearson correlation coefficients are ρtd=0.3 and ρF1F2=0.5. There are 500 decision trees and 7000 samples in the RF model. Tab. 9 gives the variable importance measures by RF method and the single-loop Monte Carlo simulation method. The cost of the MC method is 8×223 points for each case.
Distribution parameters of input variables
Variable/unit
Mean
Standard variance
t/mm
5
0.1
d/mm
42
0.5
F_{1}/N
3000
300
F_{2}/N
3000
300
P/N
12000
1200
T/N⋅mm
90000
9000
For the independent variables, the main and total sensitivity indices of input variables are very close (seen from Tab. 9), which suggests that the influence of these variables to the output response mainly come from unique variables and the interaction contribution is very small. The external force P is the most important variable in the independent space; the importance of the other input variables has a slight difference.
The VIMs of cantilever tube model
t
d
F_{1}
F_{2}
P
T
Independent space
ηi
9.690
9.216
9.407
9.937
4.060
9.416
ηi⇒Si
0.061
0.107
0.089
0.037
0.607
0.088
Si(MC)
0.060
0.112
0.086
0.038
0.615
0.088
ηiT
0.706
1.172
0.906
0.407
6.328
0.934
ηiT⇒SiT
0.068
0.114
0.088
0.039
0.613
0.091
SiT(MC)
0.060
0.112
0.086
0.038
0.615
0.089
Correlated space
ηi
10.842
9.863
9.730
9.970
4.641
10.335
ηi⇒Si
0.054
0.140
0.165
0.107
0.590
0.090
Si(MC)
0.057
0.133
0.151
0.110
0.593
0.085
ηiT
0.174
1.180
0.593
0.473
6.747
0.973
ηiT⇒SiT
0.008
0.094
0.064
0.021
0.592
0.086
SiT(MC)
0.013
0.089
0.065
0.024
0.593
0.086
Furthermore, the importance measures are different in the correlated variable space. For the correlated input variables t, d, F_{1} and F_{2} the sensitivity indices Si>SiT, the influence on the output response mainly originates from the correlated contribution by Pearson correlation coefficients. For the input variables P and T, they are independent with other variables, so the first order indices are almost equal to total sensitivity indices. Therefore, the proposed variable RF importance measure system not only reflects the important variables but also provides useful information to identify the structure of the engineering model, which will provide useful guidance for the engineering design and optimization.
Engineering Example 6: Solar Wing Mast of Space Station
The solar wing mast of space station is a truss structure in 3D space based on triangular structure, shown in Fig. 8.
Solar wing mast structure [<xref ref-type="bibr" rid="ref-32">32</xref>]
The solar wing mast is made of titanium alloy. The material properties (including density ρ, Elastic modulus E, Poisson’s ration ν), external load (including dynamic load F_{1} and static load F_{2}) and sectional area of truss A are random variables, the corresponding distribution parameters are listed in Tab. 10.
Distribution parameters of input variables
Variable/unit
Mean
Standard variance
ρ/kg⋅m^{−3}
4300
215
E/GPa
106
5.3
ν
0.3
0.015
A/m^{2}
0.0001
5×10-6
F_{1}/N
100
5
F_{2}/N
100
10
Software CATIA is used to establish the geometry and finite element model, and then taking the maximum stress as the output response, ABAQUS was repeatedly called to analyze the finite element model. And finally 210 samples were obtained. Random forest is used to analyze the variable importance measures, the results of VIMs are listed in Tab. 11.
The VIMs of solar wing mast
Variable
ηi
ηi⇒Si
ηiT
ηiT⇒SiT
ρ
3.144×1012
0.0106
2.434×1012
0.7586
E
3.133×1012
0.0138
2.454×1012
0.7647
ν
3.179×1012
0.0000
2.692×1011
0.0860
A
2.754×1012
0.1379
1.096×1012
0.3576
F_{1}
3.161×1012
0.0060
3.225×1011
0.0994
F_{2}
3.089×1012
0.0309
3.857×1011
0.1301
According to the results of variable importance measures, the main sensitivity index of Poisson’s ration ν is almost zero, and the total sensitivity index is also the minimum one. In order to simplify the model, the Poisson’s ration ν can be considered as a constant. The sectional area of truss A is the key design variable, since A has the largest main sensitivity to output. There is a large interaction between density ρ and Elastic modulus E, and the interaction sensitivity index can be indirectly solved SρE≈0.4623. For external load, F_{1} and F_{2} can be regarded as secondary variables. The variable importance measures can give designer reasonable suggestions to allocate optimization spaces of design variables more effectively and reduce the optimization dimension.
Conclusions
The Kriging regression model is used as the leaf node model of decision tree to improve the prediction accuracy of RF. The single variable, group variables and correlated variables importance measures based on RF are presented, which constitute the complete RF variable importance measure system. Additionally, a novel approach for solving variance-based global sensitivity indices is presented, and the novel meaning of these VIM indices is also introduced. The results of the numerical and engineering examples testify that the VIM indices of RF can further derive the variance sensitivity indices with higher computational efficiency compared with single-loop MC simulation.
For some incomplete probability information, such as linear correlated non-normal variables, non-linear correlated variables and discrete input-output samples and so on, the proposed importance measure analysis method has some limitations in applicability. In future work, the importance measures under incomplete probability information will be studied based on equivalent transformation or Copula function.
NomenclatureVIM
Variable Importance Measure
RF
Random Forest
DT
Decision Tree
MDI
Mean Decrease Impurity
MDA
Mean Decrease Accuracy
OOB
Out-of-Bag
SA
Sensitivity Analysis
MC
Monte Carlo
SDP
State-Dependent Parameter
HDMR
High Dimensional Model Representation
SGI
Sparse Grid Integration
ANOVA
Analysis of Variance
MSE
Mean Square Error
X, Y
the input variable vector and output response
g( )
the response function
n
the dimension of input variables
g_{0}
the expectation of response function
f_{X}(x)
the probability density function of variable X
E( ), Var( )
the expectation and variance operator
X~i
the variable vector without X_{i}
μ~i
the mean vector without μi
V,σ,ρ
the variance, standard variance and Pearson correlation coefficient of variable
μX, CX
the mean and covariance matrix of normal input variables
μ~i∣i,C~i∣i
the conditional mean vector and conditional covariance matrix of dependent normal variables
μi∣~i,Ci∣~i
the conditional mean and conditional covariance of dependent normal variable
the response vectors of corresponding sample matrices
Authors’ Contributions: Conceptualization and methodology by Song, S. F., validation and writing by He, R. Y., examples and computation by Shi, Z. Y., examples and writing by Zhang, W. Y.
Funding Statement: The authors received no specific funding for this study.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
ReferencesLu, Z. Z., Li, L. Y., Song, S. F., Hao, W. R. (2015). Borgonovo, E. (2007). A new uncertainty importance measure. Liu, Q., Homma, T. (2009). A new computational method of a moment-independent uncertainty importance measure. Cui, L. J., Lu, Z. Z., Zhao, X. P. (2010). Moment-independent importance measure of basic random variable and its probability density evolution solution. Saltelli, A., Annon, P., Auini, I. (2010). Variance based sensitivity analysis of model output: Design and estimator for the sensitivity indices. Ziehn, T., Tomlin, A. S. (2008). A global sensitivity study of sulphur chemistry in a premixed methane flame model using HDMR. Ratto, M., Pagano, A., Young, P. C. (2007). State dependent parameter meta-modeling and sensitivity analysis. Breiman, L. (2001). Random forest. Wang, J. H., Yan, W. Z., Wan, Z. J., Wang, Y., Lv, J. K.et al. (2020). Prediction of permeability using random forest and genetic algorithm model. Yu, B., Chen, F., Chen, H. Y. (2019). NPP estimation using random forest and impact feature variable importance analysis. Hallett, M. J., Fan, J. J., Su, X. G., Levine, R. A., Nunn, M. E. (2014). Random forest and variable importance rankings for correlated survival data, with applications to tooth loss. Cutler, A., Cutler, D. R., Stevens, J. R. (2011). Random forests. Loecher, M. (2020). From unbiased MDI feature importance to explainable AI for trees. https://www.researchgate.net/publication/340224035.Mitchell, M. W. (2011). Bias of the random forest out-of-bag (OOB) error for certain input parameters. Bénard, C., Veiga, S. D., Scornet, E. (2021). MDA for random forests: inconsistency and a practical solution via the Sobol-MDA. http://www.researchgate.net/publication/349682846.Zhang, X. M., Wada, T., Fujiwara, K., Kano, M. (2020). Regression and independence based variable importance measure. Fisher, A., Rudin, C., Dominici, F. (2019). All models are wrong, but many are useful: Learning a variable’s importance by studying an entire cass of prediction models simultaneously. Song, S. F., He, R. Y. (2021). Importance measure index system based on random forest. Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Saltelli, A., Tarantola, S. (2002). On the relative importance of input factors in mathematical models: Safety assessment for nuclear waste disposal. Saltelli, A. (2002). Sensitivity analysis for importance assessment. Abdulkareem, N. M., Abdulazeez, A. M. (2021). Machine learning classification based on radom forest algorithm: A review. Athey, S., Tibshirani, J., Wager, S. (2019). Generalized random forests. Badih, G., Pierre, M., Laurent, B. (2019). Assessing variable importance in clustering: A new method based on unsupervised binary decision trees. Behnamian, A., Banks, S., White, L., Millard, K., Pouliot, D.et al. (2019). Dimensionality deduction in the presence of highly correlated variables for Random forests: Wetland case study. IGARSS 2019–2019 IEEE International Geosciences and Remote Sensing Symposium, pp. 9839–9842, Yokohama, Japan.Gazzola, G., Jeong, M. K. (2019). Dependence-biased clustering for variable selection with random forests. Mara, T. A., Tarantola, S. (2012). Variance-based sensitivity indices for models with dependent inputs. Kucherenko, S., Tarantola, S., Annoni, P. (2012). Estimation of global sensitivity indices for models with dependent variables. Li, L. Y., Lu, Z. Z. (2013). Importance analysis for models with correlated variables and its sparse grid solution. He, X. Q. (2008). Song, S. F., Wang, L. (2017). Modified GMDH-NN algorithm and its application for global sensitivity analysis. He, R. Y. (2020).