Lysine Lipoylation is a protective and conserved Post Translational Modification (PTM) in proteomics research like prokaryotes and eukaryotes. It is connected with many biological processes and closely linked with many metabolic diseases. To develop a perfect and accurate classification model for identifying lipoylation sites at the protein level, the computational methods and several other factors play a key role in this purpose. Usually, most of the techniques and different traditional experimental models have a very high cost. They are time-consuming; so, it is required to construct a predictor model to extract lysine lipoylation sites. This study proposes a model that could predict lysine lipoylation sites with the help of a classification method known as Artificial Neural Network (ANN). The ANN algorithm deals with the noise problem and imbalance classification in lipoylation sites dataset samples. As the result shows in ten-fold cross-validation, a brilliant performance is achieved through the predictor model with an accuracy of 99.88%, and also achieved 0.9976 as the highest value of MCC. So, the predictor model is a very useful and helpful tool for lipoylation sites prediction. Some of the residues around lysine lipoylation sites play a vital part in prediction, as demonstrated during feature analysis. The wonderful results reported through the evaluation and prediction of this model can provide an informative and relative explanation for lipoylation and its molecular mechanisms.

Lipoylation is one of the most meaningful elements in biology. It is a unique and highly protective lysine Post Translational Modification (PTM) present in eukaryotes and prokaryotes’ proteomics [

On the other hand, lysine lipoylation sites play a significant role in protein communications and metabolic pathways [

Although most of the time, many molecular lipoylation sites stay anonymous and cannot be adequately recognised. Some of the basic steps have been taken, and computational methods have been developed for lipoylation identification. However, the experiments conducted for solving these kinds of issues are costly and time taking. The most challenging thing is to research it without discovering the lipoylation sites. This is the reason that this problem is one of the most critical topics in this field. Some sites include carbonylation, crotonylation, succinylation, glycosylation, hydroxylation, s-nitrosylation, sumoylation, phosphorylation, ubiquitination, methylation and prenylation.

The increasing development of lipoylation sites has highlighted many important issues. While predicting the lysine lipoylation sites, feature analysis reveals that the residue around lipoylation plays a very important role. For sample training, the algorithm used for classification is known as ANN [

The following research has discussed three different features and a method of 5 steps. The detailed method and flowchart are shown below in

The standardised dataset was collected from different sources, but the main source was UNIPROT. Our model has 500 negative dataset samples and 359 positive dataset samples. All the dataset was collected by applying some advanced level searches. Similar dataset samples were removed, and positive samples of the dataset were collected.

By applying Chou’s formulation method [

The length of each peptide of the collected sample was 41. Residue SPC has the following downstream location at 20, 21 residues and the upstream location at 20 residues.

where

One of the biggest problems in computational biology is formulating a discrete and vector model that includes all the information of any sequence pattern from sequence data. The important aspect of data is to derive the formation of a vector from sequence data. All the high rated algorithms of data science and deep learning such as CD algorithm as Covariance Discriminant [

SVV as site vicinity vector can be the sub-sequence of the main sequence of peptides and proteins with PTM sites. Let’s assume that

The site vicinity vector is the sub-sequence of primary sequence peptide is given as

where it has minimum constant value (MCV) that is finalised by the given experiment and holds 20 value of k in this research.

To define elements, dimensions and quantitative datasets for sample sequences, a moment is used called statistical moment. The moments are defined by mathematicians and statisticians using their distribution functions and polynomials theorems [

Using 2D transformed matrix Z’, the Hahn moment was calculated, which has the property to calculate very fast. Using Hahn moments, a calculation is done as;

In the end, probability distribution calculated the raw moments to save information that is useful for our standardised sample data set as

All 3^{rd} degree moments are q + r and N_{0}, N_{1}, N_{2}, N_{3}, N_{10}, N_{11}, N_{12}, N_{20}, N_{21} and N_{30}. These are all raw moment’s degree.

Every calculated amino acid frequency is saved in vector form, so it is called a frequency vector. The computation of the frequency vector is calculated as below in

Each q_{i} represents the sequence of alphabetical order of every different amino acid residue frequency.

20x20 matrix is formed for positional information of residue so that its name is HPRIM. The formula which helps for the calculation is given as

Each

HRPRIM is calculated. It can be calculated by using the formula

It gives us the 400 same coefficients as we received from HPRIM.

This contains no relation and no positional information, so the AAPIV was entertained by positional information. By using 20 sequences length, it can be calculated.

RAAPIV can be calculated with the relation of an informational position to find out hidden and deep features.

A method is used for error detection, which is the back-propagation method. Most feature extractors are available for sequence matrix and consist of PRIM, RPRIM and central raw moments,

While creating the latest predictor model, it is important to observe and estimate the model’s success rate. To satisfy our evaluation model, two very important questions that we should ask. (1) Which metrics are best for our prediction model quality? (2) What are the test methods that should use for score metrics?

To find out the performance and ACC of the model, there are four types of metrics which mostly helpful for calculation.

To find out the overall impact of the model, ACC is used.

To find out sensitivity, Sn is used.

To find out the specificity of the model, Sp is used.

For stability, we use MCC.

Fortunately, a group of four equations were derived [

Whether this is palmitoylated or non-palmitoylated sites,

The metrics of

The proposed Computational model is applied for predicted and actual classification, and the results show the values of true positive (TP), false positive (FP), false negative (FN) and true negative (TN). The complete view of predicted and actual model values is displayed in the confusion matrix in

Confusion matrix | Actual | Actual | |
---|---|---|---|

P | N | ||

500 (TP) |
1 (FP) |
||

0 (FN) |
358 (TN) |

Another tool shows the predictor model results, which is Receiver Operating Characteristics (ROC). The representation of the self-consistency test of this model is shown in the ROC graph in

The overall performance of the system can be seen through self-consistency testing

Feature | Accuracy in (%) | Specificity in (%) | Sensitivity in (%) | Matthews correlation |
---|---|---|---|---|

99.88 | 99.72 | 100 | 0.9976 |

Cross-validation is the process of checking whether the proposed system is perfect and more acceptable in the absence of a validation set. The dataset is divided into k-folds. The proposed Computational model is applied for predicted and actual classification, and the results show the values of true positive (TP), false positive (FP), false negative (FN) and true negative (TN). The complete view of predicted and actual model values is displayed in the confusion matrix in

Confusion matrix | Actual | Actual | |
---|---|---|---|

P | N | ||

500 (TP) |
1 (FP) |
||

0 (FN) |
358 (TN) |

Receiver Operating Characteristics (ROC) graph of 10-folds of our proposed predictor model is shown below as in

10-Fold CV | Positive | Negative |
---|---|---|

F1 | 100 | 100 |

F2 | 100 | 100 |

F3 | 100 | 100 |

F4 | 100 | 100 |

F5 | 100 | 100 |

F6 | 100 | 100 |

F7 | 100 | 100 |

F8 | 100 | 100 |

F9 | 100 | 100 |

F10 | 100 | 98.82 |

100 | 99.88 | |

99.8 |

Relatively collected datasets are used for the prediction model from previous experiment outcomes. Sometimes it depends upon the current situation to calculate model accuracy by testing using previous data. Still, it is not so easy to get the datasets that are already experimentally proven. It might be possible to get an authentic dataset but insufficient to get the result. In this condition, a specific test is performed to check the model’s credibility. If a validation set is missing, then the cross-validation method is applied to study the model performance. The accumulative dataset results are shown below in

An essential part after training is testing. Testing can be carried out ‘X’ times and can act on every single partition and after that on every iteration, hence result from accuracy can easily measure. Cross-validation is applied for the measurement of the average precision of total results. Another same method is used for positive and negative datasets. Initially, ‘X’ chooses a random value for subsets formation. Cross-validation is used for unbiased data because it is beneficial as compared with the other methods. Results of ten-fold cross-validation are given in

Feature | Accuracy in (%) | Specificity in (%) | Sensitivity in (%) | Matthews correlation coefficient |
---|---|---|---|---|

99.88 | 99.71 | 100 | 0.9975 |

To apply independent dataset testing, we have divided our dataset into two parts. One part is for training, and the other part is for testing. We split 70% dataset for training and used 30% dataset for testing our proposed predictor, and after that, the independent dataset test is executed for lipoylation. The proposed Computational model is applied for predicted and actual classification, and the results show the values of true positive (TP), false positive (FP), false negative (FN) and true negative (TN). The complete view of predicted and actual model values is displayed in the confusion matrix of

Receiver Operating Characteristics (ROC) graph of Independent dataset testing of our proposed predictor model is shown in

Confusion matrix | Actual | Actual | |
---|---|---|---|

P | N | ||

346 (TP) |
0 (FP) |
||

0 (FN) |
255 (TN) |

Confusion matrix | Actual | Actual | |
---|---|---|---|

P | N | ||

154 (TP) |
0 (FP) |
||

0 (FN) |
104 (TN) |

The overall performance of the system can be seen through independent testing. The result of the testing is shown below in

Feature | Accuracy in (%) | Specificity in (%) | Sensitivity in (%) | Matthews correlation coefficient |
---|---|---|---|---|

100 | 100 | 100 | 1 |

To find out the usefulness of our predictor model, we made a comparative analysis with already exist predictors for lipoylation sites. From

Different features | Accuracy in (%) | Specificity (%) | Sensitivity (%) | MCC |
---|---|---|---|---|

AAC | 79.39 | 78.47 | 97.69 | 0.3746 |

KNN features | 70.40 | 65.85 | 74.94 | 0.4096 |

Secondary tendency structure | 73.68 | 77.40 | 69.96 | 0.4749 |

Bi-Gram | 75.99 | 76.81 | 75.17 | 0.5199 |

Tri-Gram | 77.78 | 78.27 | 77.28 | 0.5555 |

AAF | 98.66 | 100 | 71.92 | 0.8421 |

BE | 99.77 | 99.83 | 98.65 | 0.9752 |

BPB | 99.94 | 99.93 | 100 | 0.9930 |

FNT | 80.68 | 80.29 | 81.07 | 0.6136 |

99.88 | 99.72 | 100 | 0.9976 |

As shown in

As described in most of the latest research papers [

This research intended to introduce a unique and accurate predictor model to predict lipoylation sites to get the desired results. The model’s performance was checked by applying self-model consistency and ten-fold cross-validation with the help of accuracy metrics. The self-model consistency testing of MCC, Sp, Acc, and Sn results are 0.9976, 99.72%, 99.88%, and 100%, respectively, as shown in