In the current times, COVID-19 has taken a handful of people’s lives. So, vaccination is crucial for everyone to avoid the spread of the disease. However, not every vaccine will be perfect or will get success for everyone. In the present work, we have analyzed the data from the Vaccine Adverse Event Reporting System and understood that the vaccines given to the people might or might not work considering certain demographic factors like age, gender, and multiple other variables like the state of living, etc. This variable is considered because it explains the unmentioned variables like their food habits and living conditions. The target group for this work will be the healthcare workers, government bodies & medical research organizations. We analyze the data using machine learning techniques & algorithms and predict the working of COVID-19 vaccines on specific age groups developed by significant vaccine manufacturers, i.e., PFIZER\BIONTECH and MODERNA. Data visualization and analysis interpret the vaccine impact based on the above-said variables. It becomes clear that people belonging to a specific demographic factor can have an option to choose the vaccine accordingly based on the previous history of a particular manufacturer’s vaccine getting succeeded for that demographic factor. The various machine learning algorithms we have used are Logistic Regression, Adaboost, Decision Tree, and Random Forest. We have considered the DIED variable as the target variable as this results in a high life threat. On performance measure, perspective Adaboost is showing appreciable values. The prediction of the type of vaccine to be administered could be derived using this machine learning algorithm. The accuracy we achieved based on the experiment are as follows: Decision Tree Classifier with 97.3%, Logistic Regression with 97.31%, Random Forest with 97.8%, AdaBoost with 98.1%.

In today’s world, the main concern has become COVID-19, and we are trying our best to fight this pandemic. Researchers and clinicians have been working hard for over a year to develop the vaccine for this disease finally. Now that it is here, the main task is to get it to the people. However, Researchers say people might have a negative impact from this vaccine. Certain cases have been reported around the world about such negative impacts. Several bad health conditions or symptoms that the candidate already has can lead to a horrible effect on taking the COVID-19 vaccine. The worst effect can even lead to the death of the candidate. So, it is essential to know the prior health condition of the candidate.

The main objective of this work is to develop a system that predicts if a candidate’s life is in danger once he/she takes the COVID-19 vaccine through various machine learning (ML) algorithms. We have used regression analysis in this work. Regression is used when one seeks to predict a numerical quantity. Dataset is a real-world basis that helps in training the system and predicting the result more accurately. We have taken up a dataset on adverse effects after vaccination. The data consists of vaccines and their adverse effects on individuals living in the United States. We analyze personal data, i.e., demographic factors (Age and Gender), combining it with a geographic factor (State in which candidates live) to understand the living habits. We prioritize the symptoms and group them into specific categories like normal, critical, and life-threatening. We still are not sure about how effective this vaccine can be. We predict that depending on what age groups a specific manufacturer's vaccine COVID-19 will probably show adverse effects, thereby preventing the vaccination for that age group until the vaccine is entirely successful.

Additionally, we have utilized the Adaboost algorithm, which reassigns weights to each instance to help reduce bias and variance and improve accuracy. Adaboost helped us to obtain better results. This research aimed to help us better understand the adverse effects of various vaccines with the help of four different chosen ML algorithms. We propose various methods that help reduce bias and variance in the data while maintaining excellent accuracy to help classify the candidate as suitable for vaccination. The proposed system was trained using the VAERS dataset that contains 4716 rows with 18 columns with data regarding two different types of vaccines. The rest of the paper has been structured as follows: Section 2 consists of the literature review where we analyze how the various algorithms utilized by us were applied in different scenarios. Section 3 consists of the proposed system where the system's architecture has been provided, and the various modules of the system are elaborated in detail. Information regarding the dataset utilized has also been included. Section 4 is results and discussion where we provide the various metrics achieved by our algorithms and analyze the results and graphs. Section 5 consists of a conclusion where we summarize the findings of our study and introspect about future work.

Since the technology is rapidly increasing, there are many chances and possibilities for ML in the field of healthcare [

Shipe et al. [

Souza et al. [

LR is a characterization approach that utilizes supervised learning to foresee the likelihood of a target variable. Since the existence of the target or the dependent variable is dichotomous, there exist only two classes. In simple terms, the input variable is dual, with the information presented as 1 (or yes) or 0 (or no). It is an S-shaped graph that shows any real-valued number to a value as either 0 or 1, but not exactly. Since our work is about the prediction with only two outputs, either yes (1) or no (0), Logistic regression is most suitable here. Through this regression, we can produce desirable output. It predicts whether the person has a death threat after taking the vaccine based on that person’s symptoms [

The AdaBoost algorithm, which stands for Adaptive Boosting, is used as an Ensemble Method in ML. The weights are once again assigned to every occurrence, with more weights to incorrectly classified occurrences. In supervised learning, boosting is utilized to minimize bias & variance. It is working on the idea of the consecutive growth of learners. Each successive learner, except the first, is evolved from prior grown learners. In other words, weak learners are transmuted into strong learners. Although the Adaboost algorithm operates on the very same concept as boosting, there is a minor difference in how it works [

This supervised learning approach entails RF, which is a standard ML algorithm. It is founded on ensemble learning, a method of combining multiple classifiers to find and fix problems and improve the model’s accuracy. RF is a classification algorithm that combines DT on various subsets of a dataset and midpoints the outcomes to expand the dataset’s prescient accuracy. It is capable of handling large datasets with high dimensionality, which is helpful in this work since we have used a large dataset. It also improves the model's accuracy and eliminates the problem of overfitting [

DT is a supervised learning strategy that applies to both classification & regression; however, it is frequently used to settle classification issues. Internal nodes signify dataset ascribes; branches address decision rules & each leaf node addresses the outcome in this tree-organized classifier. The Decision Node & the Leaf Node are the two nodes in a DT. Leaf nodes result from such decisions & do not have any extra branches, while Decision nodes are frequently used to settle on any decision and have a few branches. The decisions or tests are made based on the characteristics of the specified dataset. It is a visual representation for obtaining all potential solutions to an issue based on specific parameters. Node, which grows by branching out and forming a tree-like layout. We utilized the CART algorithm to construct a tree, representing the Classification and Regression Tree algorithm. A DT asks questions and divides the tree into subtrees depending on the response (Yes/No) [

The detailed working of

The data extraction is the first step where we have fetched the dataset from the VAERS. The data consists of multiple CSV files; we have to do some analysis to understand the essential attributes from the dataset. In data pre-processing, we have extracted the final dataset with all the required dependent features by pre-processing the values, i.e., empty, null, and processed other string values to appropriate integer values to process for the next step. In data visualization, we have visualized the data so that it would be easy to understand the general statistics of the data, i.e., understanding the dependencies between multiple columns such as demographics and vaccines, etc. Feature extraction involves reducing the number of attributes to a precise needed no of attributes. Then in choosing the algorithm stage, we have chosen LR, Adaboost, RF, and DT Algorithms for our ML analysis purpose. During partitioning into Training and Testing data, the Spitting of training and testing data is done in the ratio of 85(training):15(testing) percentage. We apply algorithms and predict output where 4 ML algorithms are being used to the VAERS processed dataset for training and testing purposes. After training the model against a set of data with different algorithms selected, the model is tested with another set of data to see the accuracy of our model. After going through different training and testing methods, the model can predict the most accurate result.

For the current work we have considered only 15 attributes namely: VAERS_ID, STATE, AGE_YRS, SEX, DIED, L_THREAT, RECOVD, VAX_MANU, VAX_DOSE_SERIES, VAX_ROUTE, VAX_SITE, VAX_NAME, Symptom_1, Symptom_2, Symptom_3, Symptom_4, Symptom_5. In the data pre-processing steps we have first merged the worksheet CANDIDATE DATA, VAXINE DATA & SYMPTOMS DATA using a common VAERS_ID field. Next, we have removed the unused data fields. Then we have performed optimal removal of duplicate rows from the dataset. The next steps are as follows:

VAXINE DATA (Candidates who had received more than one vaccine) i.e., we are using the vaccine data, only received the first time by the candidate (second dosages are not considered for the analysis).

SYMPTOMS DATA (Candidates who had more no of symptoms, as each row is limited to 5 Symptoms. We only consider the first 5 symptoms of the candidate).

Filling / Removal of undefined values i.e., na, null, nil. Etc whole rows are removed containing even an empty usage attribute.

The VAERS gets reported with all the vaccine data related to not only COVID19. Therefore, Vaccine types other than COVID19 are removed.

Initially, the dataset had 9287 instances but after conducting the data pre-processing steps, the data size got reduced to 4417 instances and out of which 4009 rows have taken into training the models and 708 rows considered for testing the models.

For this covid-19 vaccine analysis system, we used the VAERS dataset. VAERS data can be accessed using the CDC WONDER online search tool or downloading raw data in CSV files for import into a database or text editing software. VAERS data that has been de-identified is available from 4-6 weeks after the report is submitted. Since VAERS data changes as new reports arrive, the results can differ if we rerun the exact search later. There are a total of 4716 rows with 18 columns of data in this dataset [

We have used four different types of ML algorithms to analyze the VAERS data. These are DT Classifier, LR, Adaboost, and RF approaches. A target variable's probability is estimated using the supervised learning classification algorithm LR. The presence of the target or dependent variable is dualistic, implying that there are only two possible groups. In this case, the target variable is eligibility. It tells us whether the given candidate is eligible for the vaccination or not. RF is a supervised learning technique. It is based on a method of combining multiple classifiers called ensemble learning. This improves the system’s accuracy and solves a complex problem. RF, which is being used in this model, is a classifier algorithm that combines several DT on different subsets of a dataset. It then averages the results to improve the dataset’s predictive accuracy. One of the predictive modeling methods, DT learning, is also in this model. This method is used in data mining, analytics, and ML. This method observes an item and concludes with the item’s target value through a decision tree method as a predictive model (the leaves). Data visualization has given below.

According to the dataset, we have compared Genders, i.e., Male and Females have got vaccinated in a ratio of about 3:1, i.e., a more significant number of females were vaccinated compared to Male is given in

From DIED Attribute (N represents Not Died, Y represents Died).

The above histogram plot shows us a clearer picture of the number of people in various age groups reported with adverse effects from COVID-19 vaccination.

According to

This count plot in

From

The above box plot explains the detailed distribution summary of data based on number summaries. From

In

The markings on the x-axis are the positions on the body at which the candidate received the vaccination; for example, LA means Left Arm, and RA means Right Arm in

We can understand that the overall death rate, i.e., 0 (not died) consists of more age groups and 1(Died) consists of age groups 60 and above are the ones resulting in a severe life threat from the adverse effects of vaccination, i.e., Death from

Note that the number of MODERNA vaccines is given to more people; therefore, the number of people adversely affected by MODERNA vaccines is higher.

According to

The orange color signifies that the values are close to 0, while the green and dark orange indicate that the correlation between variables is close to +1 or −1 in

The next step after implementing ML algorithms is to find out how effective the model is based on metrics and datasets. Different performance metrics are used to evaluate different ML Algorithms. The various evaluation factors used here are true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Here we are using evaluators as follows:

Accuracy-This term tells us how many right classifications were made out of all the classifications:

Precision-This is the ratio of true positives and total positives predicted:

Recall

F1-Score-This score will give us the harmonic mean of precision and recall. F1 score is having an equal relative contribution of precision and recall.

Sl. No. | Approach | Precision | Recall | F1-score | Accuracy |
---|---|---|---|---|---|

1 | LR | 0.885417 | 0.904255 | 0.894737 | 0.973175 |

2 | RF | 0.891089 | 0.957447 | 0.923077 | 0.978814 |

3 | AdaBoost | 0.893204 | 0.978723 | 0.934010 | 0.981638 |

4 | DT | 0.911111 | 0.872340 | 0.891304 | 0.973163 |

We have set the target variable as ‘DIED’ (value 0 means live and 1 means died). Value of the following variables STATE, AGE_YRS, SEX, L_THREAT, RECOVD, VAX_MANU, VAX_DOSE_SERIES, VAX_ROUTE, VAX_SITE, VAX_NAME, Symptom_1, Symptom_2, Symptom_3, Symptom_4, Symptom_5 influences the value of the target variable. To process a new candidate, i.e., we have to input all the variable values and even the symptoms. Nevertheless, the problem here is that the symptoms from the dataset are a result of post-vaccination. There will be no symptoms before vaccination. In the analysis for the processing of symptoms, we have mapped no symptom as value ‘0’ but giving ‘0’ to every symptom (i.e., Symptom1, 2, 3, 4, and 5) for a new candidate can be done, but the outputs are not tested to detect if a person can be predicted with possibilities of death.

The below ROC curve implies the significance of the connection between sensitivity and specificity for every possible cut-off or threshold produced.

From

Prevention is better than cure. This undoubtedly applies to the current scenario of COVID19 Effects before even considering vaccination. It is better to run the previous data and analyze which factors affect when a vaccine is provided. Therefore, the vaccination can be avoided for that particular demographic factor (Ages or Genders of a particular area).

This paper is based on implementations on the dataset provided by VAERS. The dataset contained the data of individuals living in the United States of America. This paper could benefit other countries by applying similar techniques to understand and analyze the vaccination’s effects. This paper is mainly based on current scenarios of covid-19 vaccination. We have used ML algorithms, i.e., AdaBoost, DT, RF, LR, and analyzed the dataset showing the results of most affected demographic factors after getting the vaccine by certain vaccines, namely MODERNA and PFIZER\BIONTECH. The work can be further extended to any other vaccinations and diseases datasets. For the sake of proper implementation, we have restricted the no of symptoms to 5. This can be taken a step further and analyze the predictions based on multiple symptoms more than five by automating the system. Whenever new data enters the dataset, the automation can process new predictions based on the factors at that current movement.