In developed and developing countries, breast cancer is one of the leading forms of cancer affecting women alike. As a consequence of growing life expectancy, increasing urbanization and embracing Western lifestyles, the high prevalence of this cancer is noted in the developed world. This paper aims to develop a novel model that diagnoses Breast Cancer by using heterogeneous datasets. The model can work as a strong decision support system to help doctors to make the right decision in diagnosing breast cancer patients. The proposed model is based on three datasets to develop three sub-models. Each sub-model works independently. The final diagnosis decision is taken by the three sub-models independently. The power of the model comes from the diversity checks of patients and this reduces the risk of wrong diagnosing. The model has been developed by conducting intensive experiments. Several classification algorithms were used to select the best one in each sub-model. As the final results, the sub-model accuracies were 72%, 74% and 97%.

In the developed and developing countries, breast cancer is one of the leading forms of cancer affecting women alike. As a consequence of growing life expectancy, increasing urbanization and embracing Western lifestyles, the high prevalence of this cancer is noted in the developed world [

This paper aims to develop a novel model that diagnoses Breast Cancer by using heterogeneous datasets. The model can work as a strong decision support system to help doctors to make the right decision in diagnosing breast cancer patients. The proposed model is based on three datasets to develop tree sub Clustering-models. Each sub-model works independently. The final diagnosis decision is taken by the three sub-models independently. The power of the model comes from the diversity checks of patients and this reduces the risk of wrong diagnosing. The model has been developed by conducting intensive experiments. Several classification algorithms were used to select the best one in each sub-model. Most of the issued models were based on only one dataset which made the diagnosing is not accurate. The obtained results have been evaluated and discussed in details.

The paper rest is organized into four sections: section two presents SVM and Naïve Bayes algorithms. These algorithms were used to develop the model. Section three presents some interesting researches. Section four describes the research model in terms of framework, datasets, implementation and discussion. Finlay, section five concludes the paper with some recommendations as future works to enhance the model.

To develop an efficient data mining model, the appropriate data mining algorithm plays a crucial role. Data mining algorithms are classified into two groups: supervised and unsupervised. This research focuses on the supervised group to classify breast cancer patients. The following algorithms have been used in this research to develop our proposed model:

Support Vector Machines is a machine learning algorithm. It is a directed learning algorithm. It can be used for either regression or classification task [

Keep w = 1 and maximize g(x) or,

g(x) > 1 and minimize ||w||

Approach 2 is used, and the problem is stated as follows:

Subject to

Subject to

According to the Lagrange multiplier method, J is minimized for w and b, but it must be maximized for

So we can write J as:

Q(α) represents the dual form J which is only dependent on α as rest are all known scalars.

The Naïve Bayes classifier is based on Bayes’ theory of probability [

Wi is the category to which the element belongs, where the element is classified based on the attributes values.

The Naïve bay assumes independence between the Attributes, the Attributes have no relationship, as they do not affect each other Mathematically like this

As mentioned above, breast cancer is considered as one of the most serious diseases that cause death among women. The researchers have conducted many types of research related to this area. These researches were deal with breast cancer in terms of diagnosing and treatments. In this section presents some of these researches as following:

Etehadtavakol et al. [

The accuracy of the model after implementation was 89.6%. Mohanty et al. [

Two algorithms were used to build a model proposed by Mousa et al. [

By using a feature selection method “INTERACT”, Shen et al. proposed a classification model to diagnose breast cancer. The accuracy of the model was improved when using the feature selection method. The model was built based on 9 selected factors that relevant to a breast cancer diagnosis. The selection process aimed to improve the quality of the model [

By using thermography, breast cancer could be detected in early stages. Mookiah et al. [

A reduced set of discriminatory characteristics from curvelet transform for the diagnosis of breast cancer was addressed by Dhahbi et al. [

The conventional diagnosis of breast cancer also raises challenges such as poor levels of accuracy and limited self-adaptability. In order to solve these problems, the author has proposed an Ada Boost-SVM classification algorithm, combined with k-means in this work for early breast cancer diagnosis. Through measuring its precision, the uncertainty matrix that provides doctors with valuable hints for the detection of early breast cancer, the useful nesses of the suggested approaches were tested [

In this paper, a novel model has been introduced to diagnose the barest cancer effectively based on the heterogeneous sub-models. The main idea behind this model is to use heterogeneous datasets to develop multiple classification sub-models. Each one of sub-model participates in diagnosing breast cancer patients in a different way. As we mentioned above, breast cancer is a very dangerous disease. Therefore, the diagnosing accuracy is a critical issue which reflects on patients’ lives. To develop an effective model to diagnose breast cancer, we use heterogeneous medical record datasets. Each dataset consists of multiple features to represent patients’ medical records in a different way. Not like the current breast cancer classification models, the proposed model can give more accurate results because it is based in heterogeneous datasets. The research model uses five datasets to develop five sub-models. Each sub-model participates in diagnosing the patients by using a weighting method. The weighting method represents the percentage of the model in tacking a decision of the diagnosing.

The research model uses five heterogeneous datasets. Each one is used to develop a sub-model. In this section, the datasets are described as following:

Dataset1 consists of 286 instances and 9 attributes plus the class, some of which are nominal and some are linear. It is created by Matjaz Zwitter & Milan Soklic (physicians), Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia [

Attribute with description |
---|

1. Class (no-recurrence-events, recurrence-events) |

2. Age |

3. Menopause (lt40, ge40, premeno) |

4. Tumor-size |

5. Inv-nodes |

6. Node-caps (yes, no) |

7. Deg-malig (1, 2, 3) |

8. Breast (left, right) |

9. Breast-quad (left-up, left-low, right-up, right-low, central) |

10. Irradiat: yes, no. |

In dataset 2, There are 10 predictors and 116 instances which indicate the presence or absence of breast cancer, all quantitative, and a binary dependent variable. Anthropometric data and parameters that can be obtained by regular blood analysis are the predictors. Prediction models based on these predictors can theoretically be used as a biomarker of breast cancer if they are correct [

Attributes | ||
---|---|---|

Age (years) | ||

BMI (kg/m^{2}) |
||

Glucose (mg/dL) | ||

Insulin (μU/mL) | ||

HOMA | ||

Leptin (ng/mL) | ||

Adiponectin (μg/mL) | ||

Resistin (ng/mL) | ||

MCP-1 (pg/dL) | ||

Class 1=Healthy controls 2=Patients |

A digitized image of a fine needle aspirate (FNA) of a breast mass measures the characteristics. They define the features of the nucleus of a cell present in the picture. Three-dimensional space is the space defined in: [

Features |
---|

Radius_worst |

Perimeter_worst |

Area_worst |

Concave points_worst |

Concave points_mean |

Perimeter_mean |

Area_mean |

Radius_mean |

Concavity_mean |

Concavity_worst |

Area_se |

Perimeter_se |

Features |

Compactness_mean |

Compactness_worst |

Radius_se |

Texture_worst |

Concave points_se |

Texture_mean |

Concavity_se |

Smoothness_worst |

Symmetry_worst |

Compactness_se |

Smoothness_mean |

Symmetry_mean |

Fractal_dimension_worst |

Fractal_dimension_se |

Symmetry_se |

Smoothness_se |

Texture_se |

Fractal_dimension_mean |

Diagnose (B or M) |

# | Dataset | No. of Instances | Class #1 | Class #2 |
---|---|---|---|---|

1 | Dataset 1 | 286 | 85 | 201 |

2 | Dataset 2 | 116 | 64 | 52 |

3 | Dataset 2 | 569 | 212 | 357 |

The proposed model based on three sub-models. Each one is developed by using a dataset and a classification algorithm. As mentioned above, Dataset1, Dataset 2 and Dataset 3 are used to train and test the sub-models. To select the best algorithm in developing the sub-models, intensive experiments have been conducted by using five classification algorithms. The selected algorithms are SVM, Random Forest, Neural Network, Naïve Bayes and Logistic Regression. One of these algorithms will be chosen according to the evaluation process of each sub-model independently. Each sub-model contributes to the final decision to diagnose breast cancer disease.

This section presents the proposed model implementation based on the framework in

To develop the sub-model 1, Dataset 1 and SVM, Random Forest, Neural Network, Naïve Bayes and, Logistic Regression algorithms were used. Intensive experiments were conducted to select the best algorithms. The following

Model | AUC | CA | F1 | Precision | Recall |
---|---|---|---|---|---|

SVM | 0.666111 | 0.695804 | 0.63797 | 0.644655 | 0.695804 |

Random Forest | 0.700029 | 0.674825 | 0.65436 | 0.645955 | 0.674825 |

Neural Network | 0.681592 | 0.702797 | 0.696475 | 0.692275 | 0.702797 |

Naive Bayes | 0.711853 | 0.723776 | 0.721288 | 0.719255 | 0.723776 |

Logistic Regression | 0.664501 | 0.695804 | 0.668553 | 0.663799 | 0.695804 |

To develop the sub-model 2, Dataset 2 and SVM, Random Forest, Neural Network, Naïve Bayes and, Logistic Regression algorithms were used. Intensive experiments were conducted to select the best algorithms. The following

Model | AUC | CA | F1 | Precision | Recall |
---|---|---|---|---|---|

SVM | 0.82121394 | 0.74137931 | 0.74137931 | 0.74137931 | 0.74137931 |

Random Forest | 0.74519231 | 0.62068966 | 0.62068966 | 0.62068966 | 0.62068966 |

Neural Network | 0.8061899 | 0.70689655 | 0.70627763 | 0.7060815 | 0.70689655 |

Naive Bayes | 0.75510817 | 0.69827586 | 0.69906213 | 0.70493299 | 0.69827586 |

Logistic Regression | 0.76502404 | 0.71551724 | 0.7152389 | 0.71506735 | 0.71551724 |

To develop the sub-model 3, Dataset 3 and SVM, Random Forest, Neural Network, Naïve Bayes and, Logistic Regression algorithms were used. Intensive experiments were conducted to select the best algorithms. The following

Model | AUC | CA | F1 | Precision | Recall |
---|---|---|---|---|---|

SVM | 0.99467523 | 0.97363796 | 0.97362524 | 0.97361893 | 0.97363796 |

Random Forest | 0.98923154 | 0.9543058 | 0.95416824 | 0.9542587 | 0.9543058 |

Neural Network | 0.99151736 | 0.97188049 | 0.97185313 | 0.97185145 | 0.97188049 |

Naive Bayes | 0.98261191 | 0.94024605 | 0.94030239 | 0.94038327 | 0.94024605 |

Logistic Regression | 0.47975794 | 0.60456942 | 0.52590993 | 0.53839032 | 0.60456942 |

As mentioned above, many experiments had been done to determine the research model. In each one, five algorithms were used to create sub-models. Based on sub-model confusion matrixes, one algorithm is select for each sub-model. By this way, the proposed model will be stronger to diagnose breast cancer by different parameters. The following

Sub-Models | Algorithm | AUC | CA | F1 | Precision | Recall | |||||
---|---|---|---|---|---|---|---|---|---|---|---|

1 | Naive Bayes | 0.711853 | 0.723776 | 0.721288 | 0.719255 | 0.723776 | |||||

2 | SVM | 0.82121394 | 0.74137931 | 0.74137931 | 0.74137931 | 0.74137931 | |||||

3 | SVM | 0.99467523 | 0.97363796 | 0.97362524 | 0.97361893 | 0.97363796 |

This paper proposed a novel model for classifying breast cancer patients. The model was developed based on three heterogeneous datasets. The datasets were used to build three sub-models. Each sub-model can diagnose the disease independently. The power of the model comes from the diversity checks of patients and this reduces the risk of wrong diagnosing. Most of the issued models were based on only one dataset which made the diagnosing is not accurate. As we mentioned in related work all proposed models were making a classification based on only one side diagnosing and this disease has high risk, so multiple dimensions of diagnosing should be performed to reach accurate results. The proposed model covers this gap of research. The model has been developed by conducting intensive experiments. Several classification algorithms were used to select the best one in each sub-model. The obtained results have been evaluated and discussed in details.

As a future for this work, the model could be enhanced by adding more features related to lifestyle and social information. In addition, the idea of sub-model could be extended to add more dimensions and that will make the diagnoses accurate.

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through the Small Group Research Project under grant number *(RGP.1/172/42)*