Sentiment analysis attracts the attention of Egyptian Decision-makers in the education sector. It offers a viable method to assess education quality services based on the students’ feedback as well as that provides an understanding of their needs. As machine learning techniques offer automated strategies to process big data derived from social media and other digital channels, this research uses a dataset for tweets' sentiments to assess a few machine learning techniques. After dataset preprocessing to remove symbols, necessary stemming and lemmatization is performed for features extraction. This is followed by several machine learning techniques and a proposed Long Short-Term Memory (LSTM) classifier optimized by the Salp Swarm Algorithm (SSA) and measured the corresponding performance. Then, the validity and accuracy of commonly used classifiers, such as Support Vector Machine, Logistic Regression Classifier, and Naive Bayes classifier, were reviewed. Moreover, LSTM based on the SSA classification model was compared with Support Vector Machine (SVM), Logistic Regression (LR), and Naive Bayes (NB). Finally, as LSTM based SSA achieved the highest accuracy, it was applied to predict the sentiments of students’ feedback and evaluate their association with the course outcome evaluations for education quality purposes.

Sentiment analysis is a natural language processing technique used to assess whether the information is negative, positive, or neutral. Sentiment analysis is frequently performed on text-based information to help organizations screen brands and assess their items and services based on the customers’ feedback and understanding customers’ needs [

The Egyptian government directs many resources towards public services to improve the quality of life on Egyptian soil. From that prospect, it is important to infer people's opinions about the different services and facilities for continuous improvement in many of its economic sectors. An automated strategy to understand such sentiments is necessary to help better planning future services through applying several machine learning methods on a standard tweets dataset and finally compare their relative accuracy measures [

Many researchers have used sentiment analysis in different aspects of life. Research by Hermanto et al. [

Another research has surveyed the sentiment analysis techniques and is composed of four fundamental phases. First, the dataset collection phase in which the Twitter API is utilized to extract the dataset with positive and negative sentiments that characterize the different tweets. Second is the preprocessing of tweets, wherein a preprocessing step is performed for removing the slang words and the incorrect spellings before extracting the features. The slang word dictionary is made utilizing the domain information. Third, the formation of a feature vector in which explicit features like hashtags and emotions were extracted. Based on the polarity they represent feelings are assigned specific weights with “1” being the weight for positive feelings while “−1” is the weight for negative feelings. Fourth is the sentiment analysis in which the features vector was classified using standard classifiers followed by an ensemble classifier with better accuracy [

Recently, sentiment analysis is heavily applied in the education domain, specifically students’ feedback and responses to course quality surveys. This is a complicated and daunting task due to students' dialect and expressions and the amount of data that must be processed. Therefore, sentiment analysis remains challenging despite the growing number of conducted research. Many researchers approached the domain of sentiment analysis from various perspectives. Nevertheless, the comprehensive literature reviews that analyze, sort, and classify the results of the different algorithms involved with sentiment analysis in the education domain are limited. Algorithms such as deep learning (DL), big data (BD), machine learning (ML), and natural language processing (NLP) represent the main direction for achieving better accuracy. In research developed by Kastrati et al. [

With the increased social media user-generated content, particularly Twitter, the growing need for tweets sentiment analysis became apparent. That is, the analysis of the emotional status and mood of the users tweeting about a certain topic. Therefore, many researchers tackled tweet sentiment analysis aiming to improve the practical performance of the approaches used. Thus, applying their algorithms for recommendation systems applications and decision support systems. While many approaches focused on enhancing the performance using the feature ensemble method, they neither considered the sentiment context of the words nor the fuzzy sentiment. Rather frequently, they focused on semantic meaning. The fuzzy sentiment approach proposed by Phan et al. based on feature ensemble considered parameters such as word sentiment polarity, lexical and linguistic elements, and type and position of the words. They implemented the approach on real data with improved sentiment analysis performance [

As online shopping is currently a trend, especially after COVID-19, where the sales of big giants like Amazon soared, sentiment analysis of product reviews on eCommerce websites can dramatically increase the quality of service and thus user satisfaction. Yang et al. newly proposed research used data from the Chinese eCommerce platform dangdang.com after crawling and cleaning it. The research combined convolutional neural networks (CNN) with a bidirectional gated recurrent unit (GRU). The method named model-SLCABG used deep learning and sentiment lexicon to improve currently used sentiment analysis for product reviews. The algorithm used a sentiment lexicon to improve the sentiments in the product reviews. This is followed by the CNN & GRU to extract sentiment and context features. Then, classify the weighted sentiments based on the attention mechanism, which showed an enhanced sentiment analysis performance [

A few research directions have utilized evolutionary optimization techniques to optimize long short-term memory (LSTM) hyperparameters and other deep learning architectures [

Recently, in the education sector, particularly with COVID-19, many organizations were forced to opt-out of conventional education to online education. With this change happening quickly in many developing countries lacking infrastructure and technologies, many academics and students alike were resistant to this change. It became more important than ever to measure the students’ feedback and emotions. Until very recent days, many researchers worked on the identification of students’ emotions using some conventional methods. However, deep learning models, especially LSTM with attention layers, have gained more momentum to analyze students' emotions. Recent research by Sangeetha and Prabha, parallelly processed sequences of phrases across attention layers utilizing Glove and Cove embeddings. The data out of the multi-layers were fused and fed to the LSTM layer. They experimented various dropout rates to improve the accuracy. The research concludes that LSTM with fused multilayer outperforms common methods [

While research by Kastrati et al. emphasized the fact that to gain invaluable insights about the learning process, the organizations must analyze the feedback collected from students. The process could be very simple to be handled manually by a human for courses that have few students enrolled. Nevertheless, analyzing such emotions becomes impractical for courses with large number of students enrolled. For instance, online courses which are delivered through massive open online course platforms (MOOCs) [

Therefore, this paper proposes a framework to analyze the feedback from students. The methodology targets aspect-level sentiment analysis. It uses opinion polarity regarding a particular aspect in the unlabeled students’ feedback and propagates the signals to classify the aspect category. Thus, dramatically reducing the need for labeling data which is the deep learning major bottleneck. That could be achieved by utilizing a pre-trained predictive system based on a labeled dataset similar to the type of writing students might use in their feedback comments with high classification accuracy.

The layout of this research contains the following sections. Section three introduces the materials and methods employed in this research. Section four presents the prediction results, discusses the classification accuracy, and shows the application of the classifier with the highest accuracy to the students’ feedback analysis. The paper concludes in Section five.

The first stage conducted in this research is the dataset selection of tweets’ sentiments with three classes. The second stage is the preprocessing of the Twitter dataset to remove symbols, perform Stemming and Lemmatization followed by normalizing the extracted features from the tweets dataset. Several classifiers can be applied to predict the tweet’s corresponding sentiment, either positive or negative. The classification step can apply several machine learning models such as Naive Bayes, Support Vector Machines, and Logistic Regression. This research will use those techniques to represent commonly used classifiers to evaluate the proposed automated classification method as shown in the process in

The Twitter dataset contains 163k tweets along with its sentimental labeling. All the comments in the dataset are cleaned and assigned with a sentiment label using Textblob. The tweets dataset can be used to build a sentimental analysis machine learning model. The dataset is collected from the tweets posted on Twitter. To collect the dataset, the Twitter API is utilized to extract the tweets. The dataset investigated different parts of sentiment analysis classification. Other technologies such as Amazon EC2, Google Visualization, Google Charts, Google Sites, Google spreadsheets, Google Closure, and Google Analytics were utilized. In this approach, any tweet with positive feelings, like “:)”, were considered positive, and tweets with negative feelings, like “:(“, were considered negative. In each record created through a tweet, information such as tweet id, text, client name, and so on can be extracted [

The dataset consists of 162969 tweets with negative, neutral, and positive sentiment. The tweets that correspond to positive sentiments have a size of 72249 tweets the negative sentiment have a size of 35509 tweets while the neutral sentiment has a size of 55211 tweets. Therefore, the standard dataset is balanced. The following

The text needs to be cleaned, dividing it into words and taking care of case and punctuation. Indeed, an entire set of text preprocessing strategies may have to be utilized, and the selection of the right method relies upon the natural language processing task. The initial phase in cleaning up text is to have a solid idea regarding what we are attempting to accomplish, and in that setting, review text to perceive what precisely may help. Filtering out regular expressions, markups, new lines, punctuations, hyphenated descriptions, dashes, names, and markers are considered the first step to process the tweets’ text. The text cleaning frequently implies a list of words that can be utilized in the machine learning models. This implies changing over the text into a list of words.

The approach for preparing the classification algorithms input is word embedding. It includes Word2vec utilizing models such as skip-gram and continuous bag-of-words (CBOW). Skip-gram tries to predict the words surrounding a given target word, usually in the center of the context. Continuous bag-of-words does exactly the reverse of that. It predicts a word that is likely to occur in a particular context.

Stop words are the words that are filtered out which do not contribute to the deeper meaning of the sentence. Usually, they are the most common words in the language such as “the “, “a“, and “is“. They do not add sentiment information to the tweets. For sentiment analysis, it may make sense to remove the stop words. That step can be achieved by comparing each word to the stop words and filter them out [

Stemming is the way toward reducing each word to its root or base. For instance, “fishing”, “fished”, “fisher” all can be reduced to the stem “fish”. Sentiment analysis may benefit from stemming by decreasing the vocabulary and concentrating on the sentiment of a tweet instead of deeper meaning. There are many stemming techniques, although the most common and long-standing technique is the Porter Stemming algorithm. While lemmatization refers to doing things appropriately with the utilization of a vocabulary and morphological analysis of words. This regularly intends only to eliminate inflectional endings and return dictionary form of a word known as the “lemma”. Both stemming and lemmatization may likewise vary in that stemming most normally falls into derivationally related words, while lemmatization collapses the distinctive inflectional forms of a lemma [

The process of converting text to vectors through tokens is known as tokenization. It is also easier to filter out unnecessary tokens. Padding is used in sentiment analysis to make the input data sample of consistent size. Frequently, the zero-padding operation is used to fill a zero in the missing position. Thus, padding sentences to a fixed length for text classification using Bi-LSTM as illustrated by Ali et al. [

This paper evaluates the model using accuracy, precision, recall, and F1 score [

True Positive (TP): the number of positive tweets predicted as positive.

False Positive (FP): the number of negative tweets predicted as positive.

True Negative (TN): the number of negative tweets predicted as negative.

False Negative (FN): the number of positive tweets predicted as negative.

Accuracy: correctly predicted tweets divided by the total tweets.

Precision: correctly predicted positive tweets divided by the total predicted positive tweets.

Recall: correctly predicted positive tweets divided by all tweets in an actual class.

F1: the weighted average of precision and recall.

Salp Swarm Algorithm (SSA) is a class of swarm-based algorithms that belongs to metaheuristic techniques. Salp swarm species have similar features and behaviors. For instance, searching for food, locomotor performance, and communication methods. Salp belongs to the family of salpidae. It is very similar to jellyfish; barrel-shaped and moves by contracting and pumping water through their gelatinous bodies to move. They feed through internal feeding filters.

SSA, proposed by Mirjalili et al., is an optimization method based on population. The SSA behavior is deemed convincible by comparing it with the salp chain foraging for optimal food sources. That is, the ability to improve the initial random solution and converging towards the optimum (assuming the target of this swarm is an optimum food source in the search space called F). In the SSA chain, the salps are either leaders or followers based on their individual position in the chain. The chain starts with a leader who guides the movements of the followers [

where,

S_{2}, S_{3} are uniformly generated random numbers in the interval [0,1].

From _{1} is used to balance exploration and exploitation, with L_{c} being the current number of iteration and L_{M} being the maximum number of iterations, as shown in

where the S_{1} parameter is controlled through L_{c} such that the initial steps of the optimization problem are diversified while the final steps are intensified. While the previous equations show the position of the leader, the followers’ position is updated [

where

The following pseudo-code algorithm explains the main steps of the SSA and how some of the proposed hyperparameters associated with the Bi-LSTM are optimized.

Classification is an area of AI that takes raw data and classifies it to a specific class dependent on the necessary features. A utilization of computational linguistics is recognized as Natural Language Processing (NLP). With the assistance of NLP, the content can be analyzed. A sentiment is known as the inference of emotions and thoughts of any individual’s opinion. The opinion is classified among positive, neutral, and negative by utilizing a supervised machine learning algorithm. In this research work, Logistic Regression (LR), Support Vector Machine (SVM), and Naive Bayes (NB) classification models are utilized.

Logistic regression is a measurement model used to show the probability of a particular class or existing event. This can be generalized to cover multiclassification. Logistic regression is the right regression model for binary classification. Like all regression analysis strategies, logistic regression gives insight for datasets, and clarifies the relations between one dependent binary variable and one or more independent variables that are interval, nominal, ordinal, or ratio-level [

A Support Vector Machine (SVM) is a supervised algorithm that belongs to machine learning techniques that can be utilized for both classification and regression applications. SVMs are generally utilized for classification purposes and are considered one of the most used classification techniques.

SVMs rely upon discovering a hyperplane that best partitions a dataset into two classes. Support vectors are those data nearest to the hyperplane. The dataset points that whenever a point is eliminated, would change the position of the hyperplane. Moreover, they can be seen as the essential parts of the dataset [

Bayesian theory is fundamentally a structure for settling on a decision under uncertainty which is a probabilistic way to deal with prediction. Bayes hypothesized that the likelihood of future events could be determined by deciding their prior recurrence.

The advantage of the Bayesian theory is its simplicity. The forecasts depend totally on the collected data, and the more the previous data, the better the classifier performs. Another benefit is that Bayesian models are self-adjusting. That is, when data changes, so does the outcome. One exceptionally reasonable Bayesian learning technique is the Naive Bayes Classifier. It depends on the Bayesian theory and is especially suitable when the dimensionality of the dataset is high [

Normally, the further the hyperplane dataset points are located, the more certain we are that they have been successfully characterized. Therefore, we need the dataset points to be as far off from the hyperplane as possible while being on the correct side. In this way, when new testing data is added, whatever side of the hyperplane it is situated will choose the class that we assign to it [

Recurrent Neural Network (RNN) is an extension of the multilayer perceptron with feedback [

where,

Thus, due to the system’s memory, each step output depends on previous inputs and calculations. Bidirectional LSTM improves the performance of sequence classification. It runs the input in two ways, from past to future and vice versa. This guarantees that information from the past and the future is preserved at any particular moment, which adds additional context to the network and results in faster and better learning of the problem [

The root mean squared error (RMSE) is used to evaluate the performance of the model:

There are many different search strategies to find the hyperparameters of an LSTM, such as exhaustive search, random search, and Bayesian optimization. Each one of them can affect the model performance significantly. For instance, the grid search exhaustive method attempts all possible combinations of the hyperparameters discrete subset. While this performs well with few hyperparameters, its complexity grows exponentially with the increasing number of parameters. Thus, a random search that randomly selects a subset of parameters from the set of hyperparameters is known to find better solutions in less time. On the other hand, Bayesian optimization uses previous iterations to improve the sampling of the hyperparameters for the next stage. Likewise, metaheuristic methods result in global or near-optimal solutions for hard bounded optimization problems [

Hyperparameters | Selection |
---|---|

# Hidden neurons in each layer | Optimized with SSA |

Window Size | Optimized with SSA |

Dropout | Optimized with SSA |

Optimizer | SGD |

Loss Function | Mean Squared Error |

# of Epochs | 500 |

Education policymakers have very few tools to help them formulate various complex policies for the socio-technical system. Thus, few researchers have modeled the different factors that affect the quality using advanced techniques such as system dynamics [

Objective-type questions are utilized to gather feedback as proposed by Lin et al., through online surveys, and text descriptions [

In this section, the three commonly used sentiment analysis techniques: Logistic Regression, Support Vector Classifier, and Naïve Bayes Classifier used for the comparative study with the proposed paradigm. The proposed classifier performance comparison with other sentiment analysis techniques has been applied on the same labeled dataset.

Both the training performance and architecture of the Bidirectional LSTM model are shown in

The logistic regression was applied to classify the sentiments of the tweets using 500 iterations, and the hyperparameters were found using the gradient descent optimization method that resulted in an accuracy of 83%. The second classifier applied a support vector classifier with hyperparameters tuned using the grid search optimization method. The support vector classifier used a linear kernel and managed to achieve an accuracy of 91%. The last applied classifier is the Naive Bayes classifier that was optimized using the grid search optimization method and resulted in 92% accuracy. The Naïve Bayes classifier technique was superior to achieving a better classification accuracy than the other commonly used techniques.

Hyperparameters | Precision | Recall | F1-Score | Accuracy |
---|---|---|---|---|

LR | 0.84 | 0.89 | 0.87 | 0.83 |

SVC | 0.86 | 0.82 | 0.84 | 0.91 |

NB | 0.93 | 0.95 | 0.94 | 0.92 |

Bi-LSTM | 0.9907 | 0.9899 | 0.9903 | 0.9907 |

In each iteration of the LSTM hyperparameters optimization stage, the fitness is calculated and compared with the initial fitness which is specified by LSTM performance accuracy. So that, the best fitness is obtained and stored. The outcome of the completed optimization process is a new optimized population. The number of LSTM parameters to be optimized in the SSA implementation is shown in

Parameter | Value |
---|---|

SSA population size | 20 |

SSA Max Iteration | 300 |

S2 | [0,1] |

S3 | [0,1] |

The proposed classifier using Bi-LSTM has achieved a classification accuracy of 99%, which outperformed all other techniques, while its confusion matrix is shown in

In this section, the students’ feedback sentiment analysis is assessed, and its correlation with the courses’ overall evaluation is investigated as shown in

In statistics, one usually measures Pearson correlation, Kendall rank correlation, Spearman correlation, and the Point-Biserial correlation. In this work, the correlation is used to measure the association between the percentage of positive feedback and course evaluation. The correlation coefficient value ranges from −1 to +1, where ±1 represents the perfect association between the variables. While the relationship weakens as the correlation coefficient approaches zero. The positive and negative signs represent the direction of the positive relationship and negative relationship, respectively.

Course ID | No. of students | % Positive feedback | Course evaluation |
---|---|---|---|

CSE423 | 112 | 45 | 58 |

MUR233 | 164 | 30 | 48 |

CSE422 | 98 | 62 | 74 |

CSE421 | 183 | 71 | 79 |

CSE411 | 106 | 54 | 63 |

ENG231 | 172 | 27 | 42 |

CSE324 | 155 | 81 | 86 |

In statistics, Pearson correlation is widely used to measure the relationship between linearly related variables. It is a normalized measurement of the covariance. Pearson correlation is calculated as follows:

where,

Kendall rank correlation measures the strength of dependency between two variables [

Spearman rank is used to measure the degree of association between two variables. The Spearman rank is the right correlation when the variables are measured on a scale that is at least ordinal [

The point-biserial correlation is conducted with the Pearson correlation formula, except that one of the variables is dichotomous [_{xy} [

where S_{n} is the standard deviation used when data are available for every member of the population:

M_{1} and M_{0} are the mean values on the continuous variable ‘X’ for all group 1 and 2 data points, respectively.

Further, n_{1} and n_{0} represent the number of data points in group 1 and group 2, respectively, where ‘n’ is the total sample size.

The correlation coefficient is used to measure the relationship between two datasets. The p-value represents the probability of uncorrelated datasets correlating at least as high as the correlation calculated from these datasets.

Correlation | Corr. Coeff. | p-value |
---|---|---|

Pearson | 0.994 | 4.57e^{-06} |

Kendall rank | 1 | 0.0003968 |

Spearman | 1 | 0 |

Point biserial | 0.994 | 4.57e^{-06} |

Since online education is trending for the past few years and its use surged due to COVID-19, students’ feedback became ever so important, and educational organizations focused on improving their services through students’ opinions.

The conducted research applied an automated methodology to extract the suitable features from the different tweets to be further classified using several machine learning techniques. The features were normalized to achieve better performance. Four classifiers were applied for a comparable study, resulting in a better classification performance from a Bi-LSTM based SSA classifier. The best performing classifier namely, Bi-LSTM based SSA was applied to students’ feedback to assess different courses taught in Mansoura University using the hybrid learning scheme for sentiment analysis in the time of COVID-19 pandemic. The correlation between the percentage of the positive sentiment of each course and its course evaluation was statistically evaluated and found to be highly correlated. That in turn can be used to tweak the different strategies needed to achieve the best hybrid learning services to be adopted in the Egyptian Education System.