Supportive learning plays a substantial role in providing a quality education system. The evaluation of students’ performance impacts their deeper insight into the subject knowledge. Specifically, it is essential to maintain the baseline foundation for building a broader understanding of their careers. This research concentrates on establishing the students’ knowledge relationship even in reduced samples. Here, Synthetic Minority Oversampling TEchnique (SMOTE) technique is used for pre-processing the missing value in the provided input dataset to enhance the prediction accuracy. When the initial processing is not done substantially, it leads to misleading prediction accuracy. This research concentrates on modelling an efficient classifier model to predict students’ performance. Generally, the online available student dataset comprises a lesser amount of sample, and k-fold cross-validation is performed to balance the dataset. Then, the relationship among the students’ performance (features) is measured using the auto-encoder. The stacked Long Short Term Memory (

Numerous colleges and universities suffer from the poor performance of students in today’s world. Even though the latest standard is raised to higher education outcome-based education [

The infinite number of attributes generally influences students’ academic achievements that differ from non-academic to educational features [

The accessible algorithms and techniques enhance accuracy in predicting students’ performance [

However, a solution in ensemble machine learning does not contain the contribution weighted dynamically to predict the students’ performance using participating techniques. Furthermore, restrictions concern the misuse of the training dataset or utilizing a single data set to approve the model [

Here, an online accessible dataset of students is taken and considered an input to the predictor model.

The imbalanced dataset with (minority or majority samples) is balanced using the pre-processing step known as SMOTE. Later, the features are analyzed with word embedding and auto-encoders, where the learnt features play a substantial role in enhancing the prediction accuracy.

Finally, the classification process is done with a stacked LSTM model, where the network model learns the initial state to measure the performance. The simulation is done with MATLAB 2020a environment, and various metrics like precision, accuracy, recall and F1-score are evaluated and compared with multiple prevailing approaches.

The exhaustive performance evaluation is performed with the help of various metrics of the paradigm instead of the prediction approaches for the performance of initial students with seven standard illustration data sets. The experimental outcome establishes the efficiency of performance and our predictive models with efficient advantages.

We discussed the associated work from two essential viewpoints since our study concentrates on computer-based predictive analysis. The first step introduces the fundamental ideas that describe academic performance. The second step explores the modern techniques which help predict and explain those.

Commonly, the data analytics on the academic information evolves two crucial aspects: learning analytics and predictive analytics [

Prediction in higher education is a worthy task to achieve strategic advantages like the advancement of quick warning and the recommended methods to select the course path, the identification of unfortunate behaviour of students, and the automation of education program assessments [

For example, few studies have gone above predicting course grades to identify endangered students. Moreover, exact predictive modelling in academics is still tricky because of the data sparsity and exponentiality issues and influential classifiers like SVM. To manage, this illustrates the example that the last-mentioned provocation for the SVM [

The researchers proposed a technique for genetic programming to find underperformed students, especially those who feel challenged by socio-economic demerits. Students’ data are gathered from different origins in this technique, which strengthens the response commended to the decision-makers. Moreover, the suggested architecture does not find the attributes that lead to the predicted performance. The author in [

Due to the inadequate use of hybrid techniques, this technique integrates the benefits of unsupervised learning and supervised learning techniques to optimize and automate the performance of student academic prediction accuracy.

The existing techniques provide the inflexibility to examine myriad academic and non-academic factors that need to be considered to impact the quality of student education. Few methods help students’ accomplishments predict without relating them using the enabled attributes or feasible demerits. Meanwhile, only small subsets of efficient features are considered by others.

The hybrid techniques are comprised of models that do not modify the benefaction to estimate the predictions dynamically under the student’s environments.

Most prediction techniques are approved using the single dataset as an error to the approach’s viability [

Methods | Focus | Computation | Dataset | Observations |
---|---|---|---|---|

Matrix factorization and Linear regression model | Predicts the student’s results in their courses under the chosen degree pathways | RMSE = [0.63, 0.72] |
Minnesota university, USA, Private data set comprising (2k under-graduate students, 2k various courses, 75k student course grades and 2 Majors) | • The research concentrated on the prediction of grade letters. |

Attention graph convolutional network model | They predict students’ consecutive semester grades for courses and detect the in-danger students at risk of dropping out or failing. | MAE = [0.30, 0.54] |
USA, George Mason University, Private dataset comprising (43490 undergraduate students, 185 courses, 385505 grades, 5 Majors) | • Prediction on rank. |

Five models compared: KNN (k-Nearest Neighbour), Learning Discriminant Analysis (LDA), ANN, SVM, |
They predict the student’s performance at the postgraduate level by small data sets. | Precision = [58.1%, 69.7%] | Emirates, British University in Dubai. Private dataset, comprising (50 postgraduate students, nine courses, 311 instances, 1 Major) | • This technique can be used for training and show smaller data sets suitable for postgraduate studies. |

Bayesian deep learning approaches (LSTM and MLP (Multi-Layer Perceptrons)) | Prediction of student grades. Evaluating the unreliability that is related to performance. Predicting to identify the basic courses for student success. | MAE = [0.253, 0.588] |
USA, George Mason University, Private dataset comprising (28717 undergraduate students, 182 courses, 249716 grades, 5 Majors) | • This model observes the preceding semester’s courses; then, the students acquire the knowledge. |

Matrix Factorization | They predict the next term’s grade depending on latent features like tutors for courses and academic level. | MAE = [0.615, 0.654] |
USA, George Mason University, Private dataset comprising (11027 undergraduate students, 1318 courses, 140259 grades, 8 Majors) | • This technique integrates extra latent features with matrix factorization. |

Discriminative and Generative and classification models: C4.5, SVM, NB, and CART Bayes Network | Prediction on the completion of students’ degrees using the personalized attributes like the expenses in the family. | Precision = [71%, 86.7%] | Pakistan, different universities Private data set comprising 776 student data. | • This research analyzed 23 attributes; like expenses in the family, students’ success can be predicted using their personal information. |

This section provides a detailed analysis of the anticipated stacked LSTM model for exercising cooperative learning. Here, an online dataset known as students’ performance in the Exam dataset is considered for validation from Kaggle [

The samples are imbalanced with diverse classes by examining the sample distribution. In the worst-case scenario, the number of samples with majority classes is ten times higher than the minority classes. However, some samples are nearer to the classification boundaries. These factors increase the complexity of the classification task and influence the model performance. Thus, data augmentation is considered a vital factor. Here, SMOTE is used as a pre-processing approach that generates synthetic data of the minority classes. However, it does not consider the significant factors related to adjacent majority classes while synthesizing the minority class data. Therefore, the classes are overlapped, and to resolve this issue, the borderlines of the data samples are given greater attention to evaluating the nearby points of the available minority class. If the minority class is labelled first, the nearest neighbours are extracted from those available samples with the minority class. Then, the set of chosen minority classes is related to the majority class. The chosen neighbours are selected and multiplied based on the distance among the samples. The nearest neighbours range from 0 to 1. These values are included with the available data samples. Thus, the synthetic models are generated based on

Here,

The auto-encoding part of NN is split into two diverse parts: encoder and decoder. It is mathematically provided as in

The network encoding part is specified with a function passed via bias parameter b, activation function σ and latent dimension z. It is shown in

It is a related way of providing the NN’s decoding part, and it is represented with diverse activation functions, weight, and bias. It is expressed as in

The loss function

The objective of an auto-encoder is to choose suitable encoder and decoding functions with minimal information encoded and re-generated using the decoder with a minimal loss function. Based on the provided

Initially, capture the meta-data descriptive and characteristics as features and construct the feature vectors as

Apply the traditional k-means for feature vector clustering and predict the cluster group.

Consider the class groups and corresponding identifications (tags) as labels;

Fed the input data with corresponding feature vectors and generated labels for successive stages.

Then, construct the auto-encoder model based on NN with specific hidden neurons and layers, i.e., nodes.

The number of nodes over the inner layers specifies the number of clusters;

The number of nodes over the input layer specifies the feature size and vectors;

The nodes over the output layer specify the probabilistic values for the provided two datasets representing the cluster labels.

Then, partition the constructed data into testing/training datasets.

Train auto-encoder based stacked LSTM with the training dataset.

Predict and cluster the testing dataset labels with the trained network model.

The encoding part is accountable for predicting the sign or voice data’s most influencing or essential features. However, the encoder and decoder decrease the feature space, and the chosen features are used for clustering. The encoder then diminishes the full features from the most critical input data components. Subsequently, the decoder considers the diminished set of influencing features and intends to reconstruct initial values devoid of losing the information. The encoding pair forms the mechanism for diminishing the data dimensionality for clustering the clustered data. The objective of knowledge tracing relies on the students’ past status. While considering status, which is not connected with time series; however, it varies based on the learning ability. Generally, students learn gradually; therefore, while tracing students’ knowledge state, the consequences of time series are also considered. The learning level is updated constantly as it understands the related knowledge concept within a specific time and forgets it. Thus, the stacked LSTM model analyses the student’s sequence.

The concept of determining the multiple features with SMOTE measures the learning performance. Initially, to eliminate the unit restriction of every feature and transform it to a dimensional less and the numerical value of every sequential feature

Here,

The covariance matrix D is expressed as in

The eigenvectors and their corresponding values are evaluated via D. The eigenvalues are sorted from the largest to the smallest. Thus, the corresponding eigenvectors are sorted, and this method chose the initial

Here,

Here,

Here,

At

The stacked LSTM is efficiently used for resolving the gradient explosion with the set of memory units as in

The fault or error over the incoming data flow is subjected to the Gaussian distribution. However, the storage assumptions are highly efficient and provide robust outcomes. Here, Gaussian probability distribution is used to identify the attributes in the presence of a specific class label. It is expressed as

Here,

Based on the above methodology, it is known that the anticipated model is composed of two diverse sub-tasks known as feature learning and classification. Here, 70% of data is considered for training and 30% of data is used for testing. The anticipated model adopts gradient descents for training the stacked LSTM model with a 0.01 learning rate and 100 epochs (mini-batches). The loss of function and accuracy need to be monitored. The training loss is reduced, and the accuracy is increased for all the epochs (refer to the ^{th} epochs, the proposed stacked LSTM model initiates the training data optimization process. The anticipated model is trained from the beginning and evaluated during the testing process to avoid over-fitting issues. Here, cross-entropy is considered as the loss function, and it is expressed as in

While validating the multi-classification model, the proposed stacked LSTM model needs to produce the probability of every class where the target class possesses the highest probability. Here, y and y’ specify the expected and predicted possibility for the given label 1. The softmax function is utilized as an activation function over the stacked LSTM layers. The significant causes of using the softmax functions are to produce the probability range as an output from 0 to 1, and the sum of probabilities needs to be 1. The model alike of the softmax function makes the output in diverse ranges and aligns the result ranges from 0 to 1 while predicting the target class. The softmax function is provided at the output layer, and it is expressed as in

Here,

The outcomes of the various word embedding model are depicted in

Existing approaches | Prediction accuracy | Precision | Recall | F1-score |
---|---|---|---|---|

LSTM + embedding process | 77% | 90% | 62% | 73% |

LSTM + attention process | 76% | 82% | 85% | 83% |

Fused LSTM | 75% | 89% | 83% | 85% |

Aspect attention + GRU | 84% | 93% | 90% | 91% |

Layered LSTM | 85% | 87% | 88% | 87% |

Proposed stacked LSTM + auto-encoder | 89% | 83% | 86% | 87% |

Existing approaches | Prediction accuracy | Precision | Recall | F1-score |
---|---|---|---|---|

Sentiment analyzer | -- | 69% | 76% | 72% |

Supervised SVM | 78% | -- | -- | -- |

Naïve bayes | 89% | -- | -- | -- |

Naïve bayes + Lexicon | 80 | -- | -- | -- |

Bi-directional SVM | 80 | -- | -- | -- |

Proposed stacked LSTM + auto-encoder | 89% | 83% | 86% | 87% |

The model outperforms the baseline classifiers in feature learning and classification tasks, where the model attains 89% accuracy during the detection task and 90% accuracy in feature extraction. Some investigations evaluate the model over the standard dataset, including restaurant and laptop reviews. To validate the performance of the proposed model, it is applied over various domains by slight variation using the input and output parameters.

The evaluation of students’ performance is done with various aspects, where some manual computation is also done in the real-time evaluation process. A novel stacked LSTM model is used for the automatic feature learning process with the available data and classification. The prediction framework applies SMOTE for pre-processing and stacked LSTM for the classification process. The features are learnt using the auto-encoder concept to measure the influencing features over the supportive learning process. The simulation is done in MATLAB 2020a environment and various metrics like accuracy, precision, F1-score and recall. The proposed model gives 89% accuracy, 83% precision, 86% recall, and 87% F-score. The proposed model offers satisfactory outcomes compared to various existing approaches. However, the model encounters constraints like data acquisition with constant labels as the classification criteria may change among the standard dataset. In the future, the construction of a real-time dataset is highly solicited to boost the performance of the proposed classifier model.

The authors received no specific funding for this study.

The authors declare that they have no conflicts of interest to report regarding the present study.