Machine Learning (ML) has changed clinical diagnostic procedures drastically. Especially in Cardiovascular Diseases (CVD), the use of ML is indispensable to reducing human errors. Enormous studies focused on disease prediction but depending on multiple parameters, further investigations are required to upgrade the clinical procedures. Multi-layered implementation of ML also called Deep Learning (DL) has unfolded new horizons in the field of clinical diagnostics. DL formulates reliable accuracy with big datasets but the reverse is the case with small datasets. This paper proposed a novel method that deals with the issue of less data dimensionality. Inspired by the regression analysis, the proposed method classifies the data by going through three different stages. In the first stage, feature representation is converted into probabilities using multiple regression techniques, the second stage grasps the probability conclusions from the previous stage and the third stage fabricates the final classifications. Extensive experiments were carried out on the Cleveland heart disease dataset. The results show significant improvement in classification accuracy. It is evident from the comparative results of the paper that the prevailing statistical ML methods are no more stagnant disease prediction techniques in demand in the future.

Nowadays, the use of the angiography method is not uncommon for Coronary Artery Disease (CAD) diagnosis [

With the wide use of Artificial Intelligence in machines and computers, researchers aim to build intelligent systems [

Despite many kinds of research conducted regarding the prediction of heart diseases, there is still room for improvement [

This paper proposes a novel and robust method for the accurate classification of datasets having a small number of samples and features. Extensive experiments were performed on the Cleveland heart disease dataset. The results of the experiments show the superiority of the method.

This paper identifies and investigates possible enhancements to the proposed model. Moreover, this paper explores the outcomes of these possible enhancements along with a comparative study of different combinations of layers.

The remainder of the paper is organized as follows. In Section 2, an overview of previous work is shown. The proposed model and component are explained in Section 3 followed by the description of the dataset, evolution metrics, experimental results, and comparisons in Section 4, and the conclusion is drawn in Section 5.

In the last few years, different heart disease diagnostic methods have been proposed [

Ali et al. [

Šter et al. [

Previous researchers such as Kumar [

Ali et al. [

Not many researchers based their work on DL [

In this section, the details about the dataset are provided followed by an overview of the proposed method. Finally, the proposed architecture is explained in detail.

To validate the proposed method, we employed a renowned dataset named Cleveland heart disease dataset. This dataset was collected by Robert Detrano (MD) and was obtained from V.A. Medical Center, Long Beach, and Cleveland Clinic Foundation. The Cleveland heart disease dataset is available on the University of California, Irvine (UCI) ML repository. There are 303 instances in the dataset, out of which 297 instances do not contain missing values, and six have some missing attribute values. Also, the dataset has 76 raw features as per instances. But most of the previous studies used only 13 features in the prediction of heart disease. For the experiments, we have also used these 13 features.

Feature | Description | Possible values |
---|---|---|

AGE | Age in years | 30 < age < 77 |

SEX | Patient’s gender | for male = 1 and female = 0 |

CPT | Chest pain type | typical angina = 1 atypical angina = 2 |

RBP | Resting blood pressure | 94–200 mm Hg |

SCH | Serum cholesterol | 120–564 mg/dl |

FBS | Fasting blood sugar | &>120 mg/dl |

RES | Resting electrocardiographic results | normal = 0 ST-T = 1 hypertrophy = 2 |

MHR | Maximum heart rate achieved | 71–202 |

EIA | Exercise induced angina | yes = 1 no = 0 |

OPK | Old peak | 0–6.2 |

PES | Peak exercise slope | up sloping = 1 |

VCA | Number of major vessels colored by fluoroscopy | 0 1 2 3 |

THA | Thallium scan | normal = 3 |

An overview of the regression layers used in stage one is provided in this section. To provide a comprehensive comparative study, we incorporate ANN and SVM as stage two.

SVM classification is performed by finding a hyperplane that maximizes the margin between the two categories. It uses a maximum margin approach which can be transformed into complex programming problems. The vector that defines the hyperplane is called the support vector. Due to the high performance of the vector support machines in classification, they have been used in many types of research related to classification. The binary classification problem example is eliminated by hyperplane A^{T}x + b = 0. Where A and b are dimensional coefficients vertically on the surface of the hyperplane, b is the offset from the source, and x is the original value of the data set. SVM provides results for A and b. Using a Lagrangian multiplier in the linear case makes it easy to solve A. Boundary data boundaries are called support vectors For details, see [

A positive semi-definite function that conforms to the Mercer condition is represented as a kernel function (for example, a polynomial kernel) by

For binary classification problems in logistic regression [^{T}x according to the threshold classifier out is hΘ(x) at 0.5. Any value of hypothesis hΘ(x) > 0.5 is considered predictive variable y as 1. However, in the case when the value for hΘ(x) < 0.5 predictive variable y will get an as 0 and the person is considered as healthy. Therefore, in this condition, the logistic regression predicted change was 0 ≤ hΘ(x) ≥ 1. The logistic regression sigmoid function can be represented by

Similarly, the cost function for logistic regression can be expressed as:

This method resembles with least square of the linear model. Segmentation was chosen to minimize the sum of the square errors between the observed and average values per node. Minimum deviation which minimizes the mean absolute deviation from the median of the node. The advantage of the least square method is that it provides a more stable model that is more susceptible to outsiders. Suppose that we have a scalar Y as a result and a p-vector of variable X, then assume that

Linear regression [^{*} of y. The output is assumed to be a linear input function with added noise:

The term noise εt follows a normal distribution. That is, ε_{t} ~ N (Θ,_{t} = (x_{t}w + ε_{i}) with the vectors as:

Entering the Bayesian framework, we perform this task with a post-distribution of weights when using Gaussian priority. If we utilize a prior Gaussian upon the weights p (w) = N (0, Σ) and the Gaussian likelihood p (y| X_{t} ,w) p (w) = N(x_{t}w, σε2I), then this subsequent distribution is:

In the Gaussian process [

Note that it is the same as the assumption of linear regression because the observations assume a Gaussian regression independent signal term

A process of Gaussian

The proposed method uses ANN and SVM as the second stage as we intend to provide a comparative study of both robust models [

where k is the class index. The DNN can be well trained by propagating the cost functioning hypothesis in the opposite direction to measure the difference between the target output and the actual output generated by each case study. Using the softmax output function, the natural function of C is the cross entropy between the target probability d and the output of softmax p, which can be expressed as:

The target probability (usually 1 or 0) is the monitoring information provided to train the DNN classification.

The proposed method relies on the probabilities calculated by the regression techniques. Although, researchers use many basic statistical ML methods for the prediction of heart diseases. But, the improvement in accuracy seemed to be stagnant on a single point for a long time. Previous researches show that basic statistical ML method have already reached their limits and it is not possible to improve the performance conventionally. The proposed method utilizes several basic statistical ML methods. Some methods can act as strong classifiers and some methods as weak classifiers (also referred as strong or weak-learners). The proposed method is robust in predicting heart diseases. This robustness is achieved by using several weak and strong learners.

The method is inspired by DL [

The proposed method works in three stages, the first stage includes several regression layers. Moreover, several regression layers can be added to the first stage. Each regression layer is trained using the training data with actual labels. These layers may employ any type of regression model with a different configuration of hyperparameters. For example, suppose a statistical regression technique has a single tuning hyperparameter σ, and the values of σ lie between 0 and 1. Then, a layer can be a regression model with different configurations i.e. (with different values of σ). All the probabilities obtained by the first stage are first flattened and then forwarded to the second stage. For an input sample of heart disease to the first stage, the input sample is first converted to a vector of probabilities then these probabilities are passed on to the second stage. The second stage is another regression layer that learns using all the probabilities yielded by the regression layers in the first stage and the actual labels. The second stage yields the probability of being a heart disease or not. This probability is then forwarded to the third and last stage, which predicts according to the SoftMax threshold level [

As there are many possible combinations regarding the proposed method, the stage one layers include statistical regression learners such as a tree, linear SVM, medium tree, cubic SVM, quadratic SVM, medium Gaussian SVM, fine Gaussian SVM, relational quadratic, Gaussian Process Regression (GPR) coarse Gaussian SVM, and linear regression referred as L1, L2, L3, L4, L5, L6, L7, L8, L9, and L10 respectively. Further details about the regression layers and the individual performance of each layer can be found in

Layer identifier | Method | Type | Features | RMSE | TA(%) |
---|---|---|---|---|---|

L1 | Tree regression | Simple tree | 13 | 0.42706 | 71.0 |

L2 | Tree regression | Medium tree | 13 | 0.39141 | 72.6 |

L3 | SVM regression | Linear SVM | 13 | 0.37282 | 83.2 |

L4 | SVM regression | Quadratic SVM | 13 | 0.37941 | 82.2 |

L5 | SVM regression | Cubic SVM | 13 | 0.42117 | 78.2 |

L6 | SVM regression | Fine gaussian SVM | 13 | 0.48367 | 56.4 |

L7 | SVM regression | Medium gaussian SVM | 13 | 0.36705 | 82.5 |

L8 | SVM regression | Coarse gaussian SVM | 13 | 0.36636 | 82.2 |

L9 | Gaussian process regression | Rational quadratic GPR | 13 | 0.35813 | 83.2 |

L10 | Linear regression | Robust Linear | 13 | 0.36966 | 77.2 |

This paper presents two possible solutions that can be used as stage two. The first solution utilizes ANN as a stage two probability learner and as a second solution, this paper utilizes an SVM. The reasons for using ANN are flexible learning, the ability to process small amounts of data, the exposure to complex nonlinear relationships between dependent and independent variables, and the ability to predict. The reason for using SVM is that it performs well in establishing a support vector hyperplane that distinguishes between classes.

We perform comprehensive experiments on the Cleveland dataset. We use the features for training and testing in their original form and ignore the samples with missing attributes. We split the dataset into random 213 testing and 90 training samples and create three such groups to validate the generalizability of the proposed technique. Each regression layer of the first stage is trained on the training samples with actual labels. For both cases, the second stage ANN and SVM are trained on the probabilities obtained by the first stage using the actual labels. For training the ANN, the data division parameter is randomized, a scaled conjugate gradient was used during training along with cross-entropy as a measure of performance.

The model is assessed based on accuracy, sensitivity, specificity, and Matthew’s Correlation Coefficient (MCC). Accuracy is the percentage of un-quasifield items in the test data set. Sensitivity provides information about the correct classification of patient dimensions, but especially about the correct classification of healthy subjects [

While FP, TP, FN, and TN present the number of false positives, true positives, false negatives, and true negatives respectively.

MCC is generally employed for statistical analysis of binary classification. It is a metric that measures the testing accuracy of a classification model. Moreover, MCC returns a value between −1 and 1 where 1 indicates perfect predictions, and −1 is regarded as the worst prediction. MCC can be represented as:

The receiver operator characteristics (ROC) curve describes the sensitivity of the classifier in terms of true positives and false positives. The better the Area Under the Curve (AUC) measure, the better the model. The curve is based on the predicted outcome and true outcome.

The AUCs for the test datasets were calculated and used to compare the discriminative powers of the models.

The proposed method relies on several weak learners,

The proposed system is flexible and the number, order, and type of layers in the first stage can be changed accordingly. We experimented to study the effects of altering the sequence and number of layers in the first stage. Experiments revealed that in our case the performance of the proposed method using SVM and ANN was drastically affected when the number of layers was reduced. We tested the proposed system using layers between 7 and 10 for both models. The detailed results are shown in

Method | No. of layers | Layers | Accuracy | Sensitivity | Specificity | MCC |
---|---|---|---|---|---|---|

Regression layers + SVM | 7 | L1~L7 | 0.7222 | 0.7115 | 0.7368 | 0.4434 |

Regression layers + SVM | 8 | L1~L8 | 0.7444 | 0.8158 | 0.6923 | 0.5024 |

Regression layers + SVM | 9 | L1~L9 | 0.8444 | 0.9024 | 0.7959 | 0.6963 |

Regression layers + SVM | 10 | L1~L10 | 0.9555 | 0.9230 | 1.0 | 0.9138 |

Regression layers + ANN | 7 | L1~L7 | 0.6444 | 0.7027 | 0.6038 | 0.3019 |

Regression layers + ANN | 8 | L1~L8 | 0.7333 | 0.4970 | 0.8485 | 0.6667 |

Regression layers + ANN | 9 | L1~L9 | 0.8375 | 0.6740 | 0.8158 | 0.8571 |

Regression layers + ANN | 10 | L1~L10 | 0.9444 | 0.9038 | 0.9036 | 0.8927 |

In this section, the comparison of the proposed model has been shown with the previous methods. A detailed comparison of the performance is presented in

Authors | Method | Validation | Accuracy | Sensitivity | Specificity |
---|---|---|---|---|---|

Cheung | Naïve bayes | – | 81.48 | 0.8418 | 0.7610 |

Ster and Dobnikar | Linear discriminant analysis | – | 84.50 | 0.8410 | 0.8291 |

Polat et al. | AIRS | – | 84.50 | 0.6287 | 0.7630 |

Ozsen et al. | Kernel function with AIS | K-fold | 85.93 | 0.7413 | 0.8163 |

Polat et al. | Fuzzy-AIS-KNN based system | K-fold | 87.00 | 0.7860 | 82.261 |

Kashramanli and Allahverdi | Hybrid neural network system | K-fold | 86.80 | 0.7901 | 0.8197 |

Resul et al. | Neural network ensembles | Holdout | 89.01 | 0.8017 | 0.7738 |

Jankowski and Kadirkamanathan | IncNet | Holdout | 90.00 | 0.8815 | 0.8270 |

Kumar | ANFIS | Holdout | 91.18 | 0.7620 | 0.7851 |

Samuel et al. | ANN-F-AHP | Holdout | 91.10 | 0.8233 | 0.8340 |

Kumar | Fuzzy resolution mechanism | Holdout | 91.93 | 0.8950 | 0.8701 |

Paul et al. | Adaptive Weighted Fuzzy system ensemble | Holdout | 92.31 | 0.8821 | 0.8682 |

Ali et al. | Multipe SVM | K-fold | 92.22 | 0.8739 | 0.8196 |

Ali et al. | Statistical Model + ANN | Holdout | 93.33 | 0.8973 | 0.8043 |

Over the past decades, ML has gained a lot of importance in the field of clinical medicine and disease predictions. For complex problems and bigger datasets, researchers are now focusing more on DL-based approaches. Although DL approaches have high performance, for datasets having a small number of features and training samples DL is not feasible and some cases not applicable at all. In this paper, we proposed a method, inspired by regression analysis, which specially focuses on small but complex problems. The proposed method converts the representation of the data into probabilities. The proposed model consists of three main stages, the first stage converts the data into probabilities, the second stage can utilize SVM regression or ANN, and the third stage consists of a SoftMax and prediction section. We performed extensive experiments to validate the proposed method using heart disease detection as a case study. The experiments are performed on the Cleveland dataset. The results of the experiments show that the performance of basic statistical ML algorithms has reached stagnancy. The results obtained show higher accuracy. The proposed method’s application is not limited to heart disease detection.

The limitation of the proposed method is the difficulty of extending these findings on heart disease due to the small sample size. For future developments, we plan to apply this method to a larger dataset and perform the analysis of some other diseases with different feature selection techniques.

The authors received no specific funding for this study.

All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

The authors declare that they have no conflicts of interest to report regarding the present study.