Movies are the better source of entertainment. Every year, a great percentage of movies are released. People comment on movies in the form of reviews after watching them. Since it is difficult to read all of the reviews for a movie, summarizing all of the reviews will help make this decision without wasting time in reading all of the reviews. Opinion mining also known as sentiment analysis is the process of extracting subjective information from textual data. Opinion mining involves identifying and extracting the opinions of individuals, which can be positive, neutral, or negative. The task of opinion mining also called sentiment analysis is performed to understand people’s emotions and attitudes in movie reviews. Movie reviews are an important source of opinion data because they provide insight into the general public’s opinions about a particular movie. The summary of all reviews can give a general idea about the movie. This study compares baseline techniques, Logistic Regression, Random Forest Classifier, Decision Tree, K-Nearest Neighbor, Gradient Boosting Classifier, and Passive Aggressive Classifier with Linear Support Vector Machines and Multinomial Naïve Bayes on the IMDB Dataset of 50K reviews and Sentiment Polarity Dataset Version 2.0. Before applying these classifiers, in pre-processing both datasets are cleaned, duplicate data is dropped and chat words are treated for better results. On the IMDB Dataset of 50K reviews, Linear Support Vector Machines achieve the highest accuracy of 89.48%, and after hyperparameter tuning, the Passive Aggressive Classifier achieves the highest accuracy of 90.27%, while Multinomial Nave Bayes achieves the highest accuracy of 70.69% and 71.04% after hyperparameter tuning on the Sentiment Polarity Dataset Version 2.0. This study highlights the importance of sentiment analysis as a tool for understanding the emotions and attitudes in movie reviews and predicts the performance of a movie based on the average sentiment of all the reviews.

Every year, large numbers of movies are released. This number has increased in recent years as the movie industries produces more and more films each year. Whether it is a new release or an old classic, there is always something to enjoy. To feel good and escape from reality, people watch movies. Movies make them feel happy, sad, scared, and have lots of other emotions, and that is what makes movies so special and fun to watch over and over again.

Movie reviews are comments expressed by people who have seen the movie. All of these reviews determine if the movie is worth watching or not [

Opinion mining is the process of identifying and extracting subjective information from source materials using text analysis, natural-language-processing, and computational linguistics techniques [

Sentiment Analysis involves analyzing the opinion, emotions, attitudes, and feelings expressed by individuals towards entities such as events, services, products, organizations, and their features [

This section provides an overview of the ongoing research work being done in the field of Opinion Mining or Sentiment Analysis.

Ullah et al. [

Bodapati et al. [

Rahman et al. [

Chakraborty et al. [

Baid et al. [

Manek et al. [

There are many numbers of algorithms in machine learning. For this research study, Logistic Regression (LR), Multinomial Naïve Bayes (MNB), Linear Support Vector Machine (LSVM), K-Nearest neighbors (KNN), Passive Regressive Classifier (PAC), Decision Tree (DT), Random Forest Classifier (RFC), and Gradient Boosting Classifier (GBC) are chosen. The suggested method’s systematic flow is represented in

After data collection, pre-processing is performed. Pre-processing refers to the process of preparing a dataset before the use of any algorithm [

In data cleaning, html tags, links, punctuation, duplicate entries, and stop words are eliminated from both datasets.

Stemming is a method for reducing words to their root or base form in natural language processing. The stem of the word “jumping,” for example, is “jump,” and the stem of the word “jumps” is also “jump.” This is done to group related words together and minimize the dimensionality of the data, which can help some machine learning algorithms perform better. Snowball stemming are used in this experiment [

Case normalization is also a pre-processing step that is used to convert all the text in a dataset to a consistent case. This is often done to ensure that the text is in a standardized format and to prevent issues that can arise from variations in case, so all the data is converted into lowercase [

One common pre-processing step is to treat chat words, or informal words commonly used in online chat or text messaging. The Text Blob library in Python provides several methods for text pre-processing, including a method for dealing with chat words. The process of “chat words treatment” involves the expansion of abbreviated phrases like “W8,” “ASAP,” and others into their full forms.

After preprocessing, feature extraction is performed to extract features from the dataset. The process of creating new features or variables from existing ones which is used as input for machine learning models is known as feature extraction. The main purpose of feature extraction is to create a set of features that represent the important patterns found in the data and are useful for the task at hand [

In text classification projects, TF-IDF is a common measurement. Inverse papers regularity and phrase regularity are the two ratings on the TF-IDF. By dividing the total number of records in which a given term appears by the total number of records, it is possible to determine the Inverse Document Frequency. Simply keeping track of how frequently a specific phrase appears in a document will allow to calculate term frequency. Combining these ideas gives us a ranking that gives terms that frequently appear in a small number of records a higher ranking and terms that frequently appear in all documents a lower ranking, enabling us to find critical conditions in a document [

TF-IDF have been performing better than many feature extraction techniques in various cases. TF-IDF is widely used due to its simplicity, interpretability, efficiency, and as a baseline for fair comparison because TF-IDF weighs the importance of words in a document based on their frequency within that document and their rarity across a larger collection of documents. This helps to filter out common words that are less informative and to emphasize more meaningful and context-specific terms. TF-IDF can be calculated by:

Some of the well-known approaches to machine learning that have been implemented in this research study include the Random Forests Classifier (RFC), the Gradient Boosting Classifier (GBC), Multinomial Naïve Bayes (MNB), K-Nearest Neighbor (KNN), Linear Support Vector Machine (LSVM), the Passive Aggressive Classifier (PAC), Decision Trees (DT), and Logistic Regression (LR). Additional details can be found below.

The random forest classifier (RFC), which was first proposed by Breiman in 2001 [

To find the best hyperparameters tuning for a random forest classifier. The parameter grid specifies different values to try for the number of estimators, maximum depth, minimum samples split, and minimum samples leaf. The grid search uses cross-validation with 5 folds to evaluate the performance of each parameter combination using the accuracy metric. After fitting the grid search on the training data, it retrieves the best parameters and the corresponding best score.

The Naïve Bayes algorithm is used to categorize data based on probabilities. This algorithm performs fantastically even with millions of records. It simply classifies data using different probabilities and the Bayes theorem. According to the Naive Bayes model, the class with the highest probability is the predicted class. Maximum a Posterior is another name for Nave Bayes. In numerous fields, Nave Bayes has advantages and disadvantages. The algorithm is quick and incredibly scalable. It is applied to both binary and multiclass classification. It can be applied to small datasets as well, producing good results [

Naive Bayes algorithm has several variations, including Multinomial Naive Bayes, Bernoulli Naive Bayes, Gaussian Naive Bayes, Complement Naive Bayes, and Categorical Naive Bayes. In this experiment, Multinomial Nave Bayes is used to analyze sentiments in movie reviews. To find the best hyperparameter tuning value for the smoothing parameters in the Multinomial Naive Bayes classifier, grid search is considered. The parameter grid specifies a range of alpha values are tested. The grid search function with cross-validation of 5 folds, taking all the parameters and uses accuracy as the scoring metric. It fits the grid search model to the training data in parallel, evaluating all combinations of hyperparameters. The resulting grid search object can be used to access the best hyperparameter value and other useful information.

The K-Nearest Neighbor (KNN) algorithm is a machine learning technique that is simple to implement but highly effective. Both classification and regression analysis benefit from its implementation. On the other hand, its most common application is in classification prediction. The K-Nearest Neighbor algorithm groups data into cohesive clusters or subsets and makes predictions for new data based on its similarity to previously trained data. The input is put into the category that best fits it according to which class it shares the nearest neighbors with [

A grid search using cross-validation to find the best value for the number of neighbor’s parameter (k) in a K-Nearest Neighbor (KNN) classifier is used to tune the model results. The possible values being tested are 3, 5, 7, and 9. The grid search evaluates the performance of the KNN classifier using 5-fold cross-validation and the accuracy scoring metric. The goal is to find the value of k that results in the highest accuracy for the classifier.

Linear Support Vector Machines (LSVMs) are a type of supervised learning method that can be used for both classification and regression. A hyperplane is what SVMs use to divide the classes. Regression works very well with this algorithm, and SVM’s effect grows as the number of dimensions goes up. SVM also works well when the number of dimensions is bigger than the number of samples [

Grid search is used to find the best hyperparameters for a Linear Support Vector Machine (SVM) classifier. The regularization parameter ‘C’ and the term ‘penalty’ are the hyperparameters being tuned. The grid search is executed using 5 folds of cross-validation. The ‘dual’ parameter is set to False, indicating that solving the optimization problem with more samples than features is more efficient, and the best hyperparameters are determined based on the highest accuracy achieved during cross-validation.

Decision Tree is a supervised machine learning model which used for regression and classification. The training data in the decision tree is consider as a root node. A decision tree is a structured representation of these mapping relationships. A tree can be either a single leaf node assigned to a specific category or a larger structure consisting of a root node connected to two or more subtrees. The way in which an instance’s attributes are set allows a test node to predict the outcome. One of the subtrees is associated with each conceivable outcome. A root node is the first place you look when you are trying to classify something. The instance’s result is determined if this node is a test, and the process proceeds with the appropriate subtree if so. An instance’s predicted class is displayed on the leaf’s label upon discovery.

A decision tree can be created from a collection of cases using the “divide and conquer” approach. When all the nodes in the tree belong to the same class, the node becomes a leaf and the corresponding label is assigned to the leaf. If that isn’t the case, then a test is selected that yields different results in at least two of the situations. As a consequence of this finding, the cases are classified differently. The test itself is represented by the first node in the tree. Applying the same procedure to the subset of instances that have that result yields the corresponding subtree [

Logistic regression is a supervised machine learning algorithm that was created to help with classification problems. When the target variable is categorical, the problem is referred to as a classification learning problem. The goal of logistic regression is to predict the probability that a new example belongs to one of the target classes by mapping a function from the dataset’s features to the targets [

This algorithm is sometimes referred to by its alternate name, Maximum Entropy. The generalized linear models’ family of mathematical constructions includes the logistic regression classification algorithm. The modeling method known as logistic regression is used to describe the probabilities of a trial’s outcomes [

To quickly and effectively learn non-linear classification and regression models, such as decision trees and regression trees, Gradient Boosting Classifiers (GBCs) are increasingly popular. Through the gradual introduction of new learners, we can simulate a collection of weak prediction models, such as regression decision trees. Consisting of a series of nodes and leaves, it uses the results of previous nodes to predict future events. When improved collectively, regression trees outperform their individual counterparts [

The PAC is an important classifier used in online learning algorithms. The classification function is changed if there is a mistake in newly seen data or if its classification score does not go over a certain limit. The PAC algorithm has been shown to be a very useful and popular way for people to learn online and use what they have learned to solve problems in the real world. PAC is an online learning classifier that is used to keep an eye on data every day, every week, such as with news, social media, etc. The main idea behind this algorithm is that it looks at data, learns from it, and then gets rid of it without having to store it. When a mistake is made, the algorithm reacts quickly by changing the values. When a mistake is not made, it reacts slowly or passively. So, the name is a passive aggressive classifier [

A grid search is used to find the optimal hyperparameters for the model. It defines a parameter grid with different values for regularization parameter, fit intercept, and maximum number of iterations. The grid search class is used to perform the search, with settings for scoring metrics, cross-validation of 5 folds, and parallel processing. The goal was to identify the hyperparameter combination that gives the highest accuracy for the model.

Hyperparameters are parameters that are not learned from the data during training but are set manually before starting the training process. They represent higher-level configuration choices for the machine learning algorithm, and their values are typically based on the characteristics of the data being used and the algorithm’s ability to learn from that data. These hyperparameters serve as instructions or constraints for the learning algorithm, controlling its behavior and performance. Examples of common hyperparameters include the learning rate, regularization strength, depth of a decision tree, and the type of kernel used in a Support Vector Machine.

The selection of appropriate hyperparameters is important as they can greatly impact the model’s ability to generalize and make accurate predictions. Hyperparameter tuning involves systematically exploring different combinations of hyperparameter values to find the optimal settings that maximize the model’s performance on a validation set or through cross-validation [

One of the most popular techniques for exploring the hyper-parameter configuration space is grid search (GS). It evaluates all the hyper-parameter combinations provided to the grid of configurations and can be viewed as an exhaustive search or brute-force technique. The way it operates is by analyzing the Cartesian product of a user-specified finite set of values [

The computer used for this experiment was running Windows 10 Pro. The computer had a Core i5 10th generation processor and 16 GB of RAM. The experiment was performed using the Jupiter Lab development environment, which is an open-source, web-based interactive development environment (IDE) for scientific computing. The Jupiter Lab IDE was accessed through an Anaconda emulator. For use in fields like data science, machine learning, and scientific computing, Anaconda provides a Python and R distribution. It provides a convenient and easy-to-use environment for managing packages and dependencies and running code. By using Anaconda emulator, the user can launch Jupiter lab with all the necessary libraries and packages pre-installed, making the experiment setup and execution process more seamless and efficient.

To compare these machine learning algorithms two movie reviews datasets are taken which are IMDB Dataset of 50K Movie Reviews and Sentiment Polarity Dataset Version 2.0.

This is a binary sentiment classification dataset that has a significantly increased amount of data in comparison to earlier benchmark datasets. There are 25,000 extremely polar movie reviews that can be used for training, and there are 25,000 that can be used for testing [

The Sentiment Polarity Dataset Version 2.0 is made by Lillian Lee and Bo Pang. With the authors’ permission, this dataset is being redistributed with NLTK. This dataset consists of 63K reviews, half are positive and half are negative [

The model used in this study was evaluated using training time, precision, recall, F1-score and accuracy [

By dividing the number of predicted reviews by the total number of reviews, the accuracy is determined.

The number of reviews that were correctly predicted as positive is divided by the total number of reviews that were predicted to be positive to determine the Precision.

Recall is calculated by dividing the total number of reviews in that class by the number of reviews that were correctly predicted as positive.

The weighted average of recall and precision is used to calculate the F1-score.

In this experiment, the results are compared in two parts. The results of the models on the IMDB Dataset are compared with each other in

Similar to the first part, the second part compares the results of the models on the Sentiment Polarity Dataset v2 in

Models | Precision | Recall | F1-score | Accuracy | Training time |
---|---|---|---|---|---|

Linear Support Vector Machine | 90.16% | 0.81 s | |||

Logistic Regression | 87.99% | 89.4% | 89.31% | 6.18 s | |

Passive Aggressive ClasSifier | 86.63% | 87.63% | 87.13% | 87.14% | 0.42 s |

Multinomial Naïve Bayes | 86.13% | 85.99% | 86.06% | 86.17% | 0.09 s |

Random Forest Classifier | 84.27% | 85.53% | 84.90% | 84.89% | 208.67 s |

Gradient Boosting Classifier | 77.67% | 86.90% | 82.03% | 81.09% | 774.43 |

K-Nearest Neighbor | 74.94% | 84.4% | 79.39% | 78.25% | 29.00 |

Decision Tree | 71.07% | 71.79% | 71.43% | 71.49% | 63.99 |

Models | Precision | Recall | F1-score | Accuracy | Training time |
---|---|---|---|---|---|

Linear Support Vector Machine | 88.98% | 91.06% | 90% | 89.96% | 43.99 s |

Logistic Regression | 89% | 91.03% | 90% | 89.96% | 1140.20 s |

Passive Aggressive Classifier | 37.72 s | ||||

Multinomial Naïve Bayes | 85.83% | 87.34% | 86.58% | 86.56% | 0.611 s |

Random Forest Classifier | 84.26% | 87.52% | 85.86% | 85.68% | 8374.85 s |

Gradient Boosting Classifier | 82.58% | 86.89% | 82.68% | 84.21% | 14024.37 s |

K-Nearest Neighbor | 74.28% | 85.64% | 79.56% | 77.92% | 4733.60 s |

Decision Tree | 75.03% | 74.41% | 74.22% | 74.36% | 5540.41 s |

Models | Precision | Recall | F1-score | Accuracy | Training time |
---|---|---|---|---|---|

Linear Support Vector Machine | 69.35% | 69.75% | 69.16% | 0.34 s | |

Logistic Regression | 69.56% | 70.99% | 70.27% | 69.20% | 1.01 s |

Passive Aggressive Classifier | 66.45% | 68.99% | 67.70% | 66.24% | 0.48 s |

Multinomial Naïve Bayes | 70% | 74.99% | 0.04 s | ||

Random Forest Classifier | 65.16% | 65.18% | 65.17% | 64.28% | 247.57 s |

Gradient Boosting Classifier | 55.92% | 67.91% | 58.11% | 93.80 s | |

K-Nearest Neighbor | 52.62% | 45.37% | 48.73% | 51.04% | 13.98 s |

Decision Tree | 58.77% | 58.6% | 58.69% | 57.7% | 25.42 s |

Models | Precision | Recall | F1-score | Accuracy | Training time |
---|---|---|---|---|---|

Linear Support Vector Machine | 70.39% | 70.07% | 70.23% | 69.54% | 43.49 s |

Logistic Regression | 70.07% | 70.27% | 70.17% | 69.37% | 24.05 s |

Passive Aggressive Classifier | 70.36% | 70.88% | 70.62% | 69.76% | 35.2 s |

Multinomial Naïve Bayes | 70.22% | 0.57 s | |||

Random Forest Classifier | 63.49% | 71.32% | 67.18% | 64.27% | 7128.54 s |

Gradient Boosting Classifier | 58.03% | 81.36% | 67.74% | 60.76% | 2058.31 s |

K-Nearest Neighbor | 51.51% | 67.52% | 52.64% | 1474.46 s | |

Decision Tree | 58.15% | 58.12% | 58.14% | 58.13% | 2351.60 s |

For the IMDB Dataset of 50K reviews, train test split was performed in which the test size was 25% and training size was 75%. After applying these models, Linear Support Vector Machines performed the highest accuracy among the models, with a maximum accuracy of 89.48% without using hyperparameters tuning. The results are shown in

The performance visualization of IMDB Dataset is visualized in

The hyperparameters tuning results of models used in this study are shown below in

The Passive Aggressive Classifier achieved higher accuracy than Logistic Regression and Linear SVM because of more flexible and adaptable learning after applying grid search and hyperparameter tuning that better captured complex patterns and relationships in the IMDB dataset.

On the other hand, K-Nearest Neighbor and Decision Tree achieved the lowest accuracy than other models used in this study. The lower accuracy of K-Nearest Neighbor (KNN) can be attributed to its sensitivity to irrelevant or noisy features in the dataset, as well as the computational complexity that increases with larger datasets.

The lower accuracy of the Decision Tree model is due to its tendency to overfit the training data with complex decision boundaries, resulting in reduced generalization, as well as its high variance and sensitivity to small changes, which leads to instability and lower accuracy on the IMDB dataset. The comparison of model accuracies used in this study with and without hyperparameter tuning are shown in

For the Sentiment Polarity Dataset, train test split was performed in which the test size was 25% and training size was 75%. After applying, some results were achieved. From the following machine learning techniques, Multinomial Naïve Bayes gives the maximum accuracy, from which 70.69% of accuracy was achieved. Other results are given below in

The performance measurement of Sentiment Polarity Dataset is visualized in

Among the models compared on the Sentiment Polarity Dataset Version 2.0, Multinomial Naïve Bayes (MNB) performed the best accuracy score of 70.69% then other models. Multinomial Naïve Bayes is known for its ability to handle larger datasets efficiently. The assumption of conditional independence between features (words) given the class in MNB allows it to scale well with the number of features, making it suitable for text classification tasks with a high number of words or features [

On the other hand, Decision Tree (DT) and K-Nearest Neighbor (KNN) performed the worst, potentially due to DT’s tendency to create complex decision boundaries and overfit the training data, as well as KNN’s sensitivity to irrelevant features and challenges in handling high-dimensional text data.

The hyperparameters tuning results of models on Sentiment Polarity Dataset Version 2 are shown below in

The comparison of model accuracies used in this study with and without hyperparameter tuning are shown in

According to the results of hyperparameter tuning on the Sentiment Polarity Dataset Version 2, Multinomial Naive Bayes dominates with a maximum accuracy of 71.04%, but the passive aggressive classifier outperformed Linear SVM and Logistic Regression once more. In terms of recall, K-Nearest Neighbor and Decision tree outperformed other models due to their ability to identify a greater proportion of positive or negative sentiment instances.

Opinion mining also called sentiment analysis is the extraction of subjective data from textual data. The paper focuses on a comparison of Linear Support Vector Machines and Multinomial Naive Bayes with baseline machine learning algorithms for extracting sentiments from movie reviews. The findings of the study discovered that the Linear Support Vector Machines achieved a maximum accuracy of 89.48% on the IMDB movie reviews dataset containing 50,000 reviews. After applying hyperparameter tuning, both Linear Support Vector Machines and Logistic Regression improved their accuracy to 89.96%. However, the Passive Aggressive Classifier surpassed them all with a maximum accuracy of 90.27% on the IMDB dataset. While on Sentiment Polarity Dataset Version 2.0, Multinomial Naive Bayes appeared as the leading algorithm with a maximum accuracy of 71.04%. In this study, it was discovered that Linear Support Vector Machines and Logistic Regression performed better on high-dimensional textual data and Multinomial Naïve Bayes performs better on larger datasets than other classifiers used in this study. Naïve Bayes, Passive Aggressive Classifiers, Logistic Regression, and LSVM, which achieved the highest accuracy, can serve as benchmark models for future research in sentiment analysis, providing valuable information to enhance prediction performance. Our future work will focus on two main directions: firstly, conducting a comprehensive comparative analysis of the latest techniques in sentiment analysis for movie reviews, and secondly, exploring deep learning models to further enhance sentiment analysis in this domain.

We would like to express our sincere gratitude to all the individuals and organizations who have contributed to the successful completion of this research paper. We also acknowledge the contribution of our participants, who provided valuable feedback and suggestions that helped to refine this research methodology and enhance the quality of the findings. We would like to thank the participants who generously gave their time and participated in our study, without whom this research would not have been possible. Thank you all for your contributions to this research paper.

The authors received no specific funding for this study.

The authors confirm contribution to the paper as follows: study conception and design: SSK and BK; data collection: MMD, MK and MBG; analysis and interpretation of results: MMD, SSK, MA and BK; draft manuscript preparation: MMD and SSK. All authors reviewed the results and approved the final version of the manuscript.

The datasets analyzed during the current study are available from the corresponding author upon reasonable request.

The authors declare that they have no conflicts of interest to report regarding the present study.