The classification of cancer subtypes is substantial for the diagnosis and treatment of cancer. However, the gene expression data used for cancer subtype classification are high dimensional in nature and small in sample size. In this paper, an efficient dimensionality reduction with optimized long short term memory, algorithm (OLSTM) is used for gene expression data classification. The main three stages of the proposed method are explicitly pre-processing, dimensional reduction, and gene expression data classification. In the pre-processing method, the missing values and redundant values are removed for high-quality data. Following, the dimensional reduction is done by orthogonal locality preserving projections (OLPP). Finally, gene classification is done by an OLSTM classifier. Here the traditional long short term memory (LSTM) is modified using parameter optimization which uses the adaptive artificial flora optimization (AAFO) algorithm. Based on the migration and flora reproduction process, the AAFO algorithm is stimulated. Using the accuracy, sensitivity, specificity, precision, recall, and f-measure, the proposed performance is analyzed. The test outcomes illustrate the effectiveness of the gene expression data classification with a 94.19% of accuracy value. The proposed gene expression data classification is implemented in the MATLAB platform.

Cancer is the subsequent driving reason for death universally with 9.6 million passing each year. New cancer cases that emerge each year are 18.1 million [

Gene expression uncovered consequences of a few genomic recognizable projects, throughout the long term, deciding the records of genetic components has solidly moved from microarray innovations to sequencing [

A decent method to achieve this is through dimensionality reduction (DR), which expects to diminish the volume of data contained in data sets and all the more explicitly, the traits, along with these lines upgrades the working capacity of the learning strategies by disposing the conflicting data. DR is a vital issue in the preparation of high-dimensional data. It is fundamental to select the most important characteristics. When the important features are chosen, data classification is utilized to partition genes into various gatherings as per the likeness of gene expression data. Classification models applied to gene expression data have isolated between different cancer subtypes just as among typical and cancer tests. Furthermore, clinical data have been composed with gene expression data to assemble the detection exactness. Models reliant upon clinical and gene expression data further develop the detection precision of a sickness result as stood out from discovery subject to either data alone.

The main objective of proposed approach is to classify a patient data as normal or cancer data. The high dimensional data increases the complexity and reduces the classification accuracy. To reduce the high dimensional data, OLPP algorithm is used and for classification, OLSTM classifier is used. To enhance the performance of LSTM classifier, the weight values are optimally selected using AAFO algorithm. The artificial flora algorithm is enhanced by using orthogonal based learning (OBL) strategy. The OBL strategy increases the searching ability of AFO algorithm. The main contribution of proposed approach is listed below:

To reduce the complexity and increases the classification data, the high dimensional input data are reduced by using OLPP algorithm. OLPP algorithm is reduces the difficulty present in the principal component analysis (PCA) and Locality Preserving projections (LPP).

We design an OLSTM classifier for classification process. To enhance the performance of LSTM classifier, the weight values are optimally selected using AAFO algorithm.

The proposed AAFO algorithm is a combination of AFO algorithm and OBL strategy. The OBL algorithm is used for increases the searching ability of AFO.

Many of the researchers had developed the gene expression data classification. Among them few of the works are analyzed here; He et al. [

Elbashir et al. [

By combing the AdaBoost and genetic algorithm (GA) based cancer detection was proposed by Lu et al. [

Xu et al. [

Pilar et al. [

Sun et al. [

In this paper, an optimized hybrid classifier-based gene expression data classification is proposed to detect cancer. The proposed approach consists of three main stages namely, pre-processing, dimension reduction, and classification. In pre-processing, missing values are filled and redundant data are removed. To reduce the time complexity and increases the classification accuracy, the dimensionality reduction process is applied to pre-processed data. For dimension reduction, OLPP is utilized. Finally, the reduced dataset is given to OLSTM to classify the gene expression data namely, breast invasive carcinoma (BRCA), kidney renal clear cell carcinoma (KIRC), colon adenocarcinoma (COAD), lung adenocarcinoma (LUAD), and prostate adenocarcinoma (PRAD). To enhance the LSTM classifier, optimal weights are selected by the adaptive artificial flora optimization (AAFO) algorithm. In

In general, real-world data is substandard and cannot be given as an input directly into data mining techniques. Such information is regularly inadequate. Subsequently, hidden attributes can be hard to track down, which might bear some significance with the domain master in the information, technically referred to as anomalies. Therefore, pre-processing of source data is required. In pre-processing, data cleaning is an important step for high-quality data. Data cleaning involves the following processes such as noise removal, missing value assessment, and background correction, following which data is normalized. To improve the effectiveness of the proposed method they should be handled with caution. Then the pre-processed data is fed to the dimension reduction process.

To minimize the large dimensions of the features, orthogonal locality preserving projections (OLPP) are used after pre-processing. A large number of improper highlights lessen the exactness of gene data classification. The dimension reduction technique is utilized to lessen the element space without losing the exactness of the order. OLPP is a linear strategy that tries to protect the nearby design of data in the transformation domain. Traditional LPPs are difficult to remake because the information is non-orthogonal. This defect could be overawed by the use of the orthogonal locality-preserving projection technique. It generates orthogonal complex work, so it can have more fractional-storage power than LPP. OLPP is the orthogonal extension of the LPP [

PCA is an approach that minimizes data dimension by playing out a covariance examination between factors. The PCA projection includes the accompanying advances:

Let us consider the patient record is

The weight

Here the t value represents the constant value. Considered W_{ij} = 0, when the node

After choosing the weight matrix then find the diagonal matrix

Next, evaluate the Laplacian matrix

We define the be orthogonal basis vectors

To compute the orthogonal basis vectors,

Calculate

Calculate

Let

The transformation matrix is represented as T and the one-dimensional representation of X is Y.

This transformation matrix lessens the dimensionality of the dataset. Given the above cycle, this strategy decreases the dimensionality of the element vectors of the gene expression data. Next, to classify the gene expression data, OLSTM is proposed here.

LSTM is a specialized type of recurrent neural network (RNN). The principle of the recurrent neural network is storing the yield of a specific layer and feeding back to the input. RNN makes extensive use of sequence data that use short-term memory to process the sequence of inputs. The main limitation of RNN that is naive is that it cannot store long-term memory. To meet this challenge, LSTM is proposed. The designated LSTM has three gates, namely the forget gate, input gate, and output gate is denoted as

The Forget gate is used to select the discard and selected information and stored in memory. The mathematical function is given in

where, F_{t} defines forget gate, c_{F} represents the forget gate control parameter, w_{F} represents the forget gate weight, A_{t} defines input of the system, L_{t}−1 represents the output of existing LSTM block,

The input gate function is given in

where,

It → input gate

A_{t} → LSTM block present output

W_{I} → input gate neurons weight

C_{I} → bias value of input gate

The candidate value of tanL layer is calculated using

where Vt represents the candidate at timestamp (t) for the cell state.

Candidate value is used at the input gate to select the vector and to choose whether to keep or delete information in the forget gate memory depends on the output from the

where memory cell state is represented as Mt and * defined as the element-wise multiplication. At last, the output gate regulates using

where, the output gate is denoted as O_{t}, wo denote the weight value of output neuron, and bias value is denoted as C_{o}. The output function is calculated using

where, the * denotes the vector’s element-wise multiplication and memory cell sate is represented as V_{t}. The total loss function of the LSTM system is given in

Desired output is represented as T_{t} and N represent the total number of data point to calculate the loss mean square error. To enhance the performance of the LSTM classifier, AFO algorithm is utilized to select the optimal weight. A detailed explanation of the AFO algorithm is illustrated below.

To improve LSTM classifier performance, the weight values (W_{F}, W_{I}, W_{V}, W_{O}) present in the LSTM are optimally selected using the AAFO algorithm. Based on the migration and flora reproduction process, the AFO algorithm is stimulated. This algorithm can be utilized to tackle some complex, non-direct, interesting optimization issues. Although a plant can’t be moved, it is feasible to spread the seeds inside a specific reach and track down the most appropriate climate for the offspring. The irregular cycle is not difficult to duplicate, and the spread space is wide; hence, it is appropriate to apply the intelligent optimization method. The artificial floras algorithm comprises four fundamental parts: the original plant, the offspring plant, the plant location, and the propagation distance. Original plants indicate to plants that are prepared to spread seeds. The offspring are the seeds of the first plants that couldn't propagate the seeds around then. Plant location is the area of a plant. Propagation distance alludes to how far a seed can spread. There are three main types of behavior namely, Evolution behavior, Spreading behavior, and Select behavior. To enhance the performance of the AFO algorithm, after updating the solution, the solutions are again updated by using crossover and mutation. Steps involved in weight optimization are illustrated below;

_{F}, W_{I}, W_{V}, W_{O}). The length of the plant is

where ^{th} plant at k^{th} iteration. The random weight value is chosen between [0,1].

where;

where u represents the maximum weight value and v represents the minimum weight value. The opposite solution of W_{i}(k) is given in

where,

where, M shows the total number of plants and

where,

Creating the new original plant when there is no survival of offspring plant,

where, t represents the maximum limit

where,

In ^{th} solution fitness value. Based on the probability values, the solutions are updated.

Based on the above process, the proposed classifier is utilized to classify the gene expression data namely, BRCA, KIRC, COAD, LUAD and PRAD. The efficiency and effectiveness of the implementation method are analyzed in result and discussion section.

Recommended gene expression data classification is implemented on MATLAB sites. This method uses input database Gene expression Cancer RNA-seq Database and it is available at

Method | S. No | Description | Value |
---|---|---|---|

The proposed method (AFO) | 1 | Flora size | 50 |

2 | Maximum iteration | 100 | |

3 | Maximum epochs | 15 | |

4 | Learning coefficient | 0.2 | |

5 | Initial position | [0,1] | |

6 | Survival probability | 0.5 |

The main objective of proposed approach is to classify a cancer data using dimension reduction and classification algorithm. For dimension reduction, the OLPP algorithm is used here. It reduces the complexity of classification. To assess the performance of proposed dimensionality reduction, PCA and LPP methods are embedded in the OLSTM classifier to carry out the classification diagnosis. The accuracy of a classification and different metrics are shown as follows.

In

The performance of the suggested technique is analyzed based on precision is given in

Methods | Accuracy | Sensitivity | Specificity | Precision | Recall | F-measure |
---|---|---|---|---|---|---|

OLPP+OLSTM | 94.19 | 92.99 | 96.99 | 94.54 | 95.79 | 95.3 |

PCA+OLSTM | 90.54 | 88.82 | 93.96 | 90.75 | 91.75 | 92.1 |

LPP+OLSTM | 89.37 | 86.85 | 92.65 | 89.64 | 89.64 | 89.3 |

OLSTM | 85.07 | 85.07 | 88.51 | 85.81 | 86.81 | 86.3 |

To demonstrate the effectiveness of the proposed approach, we compare our proposed work with already published research. Here the proposed method considering the existing method is Elbashir et al. [

The proposed efficiency is analyzed using accuracy in

Here, the method designed an efficient dimensionality reduction with OLSTM for gene expression data classification. Here the input dataset is downloaded from the UCI machine learning repository. At first, the downloaded data is preprocessed. Then dimensionality reduction is done by OLPP. Finally, gene expression data classification is done by OLSTM. Here the weight is optimized using the AAF method. The proposed genetic expression data classification is implemented in MATLAB. Using accuracy, sensitivity, specificity, accuracy, retraction, and F-measure, the proposed performance is analyzed. From the test results, the efficiency of the gene expression data classification with maximum accuracy value is clearly shown. The OLPP+OLSTM method achieves 95.5% classification accuracy. It is clear from the above results that the proposed method demonstrates that in real-time it is possible to classify genetic expression data with better accuracy compared to other methods. Future work of the proposed method is to improving classifier performance by designing an efficient feature selection algorithm.

The author with a deep sense of gratitude would thank the supervisor for his guidance and constant support rendered during this research.