Predicting salary trends of students after employment is vital for helping students to develop their career plans. Particularly, salary is not only considered employment information for students to pursue jobs, but also serves as an important indicator for measuring employability and competitiveness of graduates. This paper considers salary prediction as an ordinal regression problem and uses deep learning techniques to build a salary prediction model for determining the relative ordering between different salary grades. Specifically, to solve this problem, the model uses students’ personal information, grades, and family data as input features and employs a multi-output deep neural network to capture the correlation between the salary grades during training. Moreover, the model is pre-trained using a stacked denoising autoencoder and the corresponding weights after pre-training are used as the initial weights of the neural network. To improve the performance of model, dropout and bootstrap aggregation are used. The experimental results are very encouraging. With the predictive salary grades for graduates, school’s researchers can have a clear understanding of salary trends in order to promote student employability and competitiveness.

In recent years, research on graduate employment trends has been one of the main research focuses on institutional researchers. In 2015, the Statistics Bureau at Ministry of Education in Taiwan [

The rest of this paper is organized as follows. Section 2 provides concise literature reviews, such as ordinal regression, stacked de-noising auto-encoder, bootstrap aggregating, and related educational data mining case. Section 3 introduces the research design and describes the data processing and data analysis methods. The system design and implementation are also described. Section 4 presents and discusses the empirical results. Section 5, the final section, concludes the paper with the listing of the contributions and limitations of this study.

In this section, we first review the methods used for data preprocessing and related literature. Therefore, the methods in the literature can be divided into ordinal regression problem settings, stacked de-noising auto-encoders, and Bootstrap Aggregating integrated learning algorithms. In the second part of this section, we reviewed the concepts and theoretical models related to mining cases with relevant educational data.

Ordinal regression is a supervised learning problem aiming to classify instances into ordinal categories [

Niu et al. [

Chang et al. [

Srivastava et al. [

When data contain noise, the model easily indicates overfitting; besides, the impact of noise on the model is more intense where there is a small amount of data. Therefore, Vincent et al. [

Vincent et al. [

Bootstrap Aggregating is a kind of Ensemble Learning algorithm. This algorithm can be used in the classification or regression model. Bootstrap Aggregating random sampling effectively reduces the noise and outlier in datasets and to avoid overfitting, reduces the impact of noise on the model.

In a study, reference [

The study [

The salary grading prediction model design process is divided into three: data preparation, construction of a deep network model, prediction, and evaluation. The details are shown in

College and master’s students who graduated in the 2011–2015 academic year from the National Taipei University of Technology (NTUT) comprise the data source for this study. The input features are given in

This study excluded unemployed, uninsured, and part-time employees from its data. We classified those who were insured for less than 30 days and those with less than 20,100 salary grading as part-time employees. Due to the data size constraints, this study merged the data into nine categories.

Group | Feature name |
---|---|

Grades at School | Department Rank, |

Personal Information | Educational System, College, Department, Birthplace, Admission Method, Gender, Preferential Treatment to Enter School, Enter Year, Enter Semester, Graduation Year, Graduation Semester, Extended, Suspended, Loan |

Parents’ Information | Father’s Education, Mother’s Education |

Data quality has a significant influence on model predictions [

Feature Scaling

Different value ranges of data may lead to slow convergence and predictive performance degradation during model training. In this study, the characteristics of the numerical data were normalized (Min-Max Normalization) so that the distribution of the input features had an interval scale [0,1]. The formula is as follows:

Data Encoding

Non-numerical data must be encoded into a numerical form before it can be inputted into the neural network model. This study performed One-Hot Encoding on the unordered non-numerical data and Label Encoding on the ordered non-numerical data.

Deal with missing data and outlier

Missing values affect the performance of a model. Sun et al. [

This study divided the data into training dataset, validation dataset, and testing dataset. It used the latest academic year’s data as a testing dataset to ensure the generalization of the model. 80% of the remaining data was used as a training dataset, while 20% of the remaining data was used as a validation dataset. The details of the count data are shown in

Training dataset | Validation dataset | Testing dataset | |
---|---|---|---|

One year after graduation | 2261 | 565 | 845 |

Two years after graduation | 3097 | 776 | 1307 |

Three years after graduation | 2562 | 624 | 1623 |

This study built neural networks for each year after graduation. We used three (3) layers of fully connected neural networks and an output layer based on the multiple output layer framework proposed by Niu et al. [

The traditional neural network usually uses 0 or a value close to 0 as the initial weights, which makes the deep network tend to converge to the local minimum. Apart from this, when there are few data, the noise is very influential to the model, thus making the model easily indicate overfitting. This study used a Stacked De-noising Auto-encoder to pre-train weights so that it could reduce the probability of the model convergence to the barest minimum and thus, effectively reduce the impact of noise on the model. Our model pre-training process is shown in

This study used pre-trained weights as the initial weights of the neural network. The output layer used a multiple output layer framework proposed by Niu et al. [

As shown in

When the model is too complex, it is easy to cause the model to over-fit the training data and reduce the generalization ability. Hence, this study used L2 regularization to measure the complexity of the model and to reduce the weight value of the unimportant model. It further used Dropout to prevent the neurons from cooperating with each neural. Finally, this study used early stopping to immediately stop training when the validation loss is not reduced. The network training algorithm is given in

To improve the generalization ability of the model, this study used the Bootstrap Aggregating algorithm to train multiple neural networks. As shown in

For the ordinal regression problem, the standard method for measuring performance is MAE (Mean Absolute Error) [

However, the ordinal regression problem is prone to feature imbalance in the categories. Baccianella et al. [

This study used the macro-averaged MAE; it is represented as MAE^{^M}. The formula is as shown in

Using SDAE for pre-training can reduce the dimensionality and impact of noisy data. This experiment compared performance between using and without using pre-training, to understand whether the model has better performance after pre-training.

This study compared the number of different network layers with the number of neurons to find the number of hidden layers and neurons that were most suitable for the model. As shown in

Layer | MAE |
---|---|

4 Layer (1000-700-500-300) | 1.659 |

4 Layer (700-500-200-150) | 1.643 |

3 Layer (1000-700-500) | 1.542 |

2 Layer (1000-700) | 1.543 |

2 Layer (700-500) | 1.529 |

Optimizer | Learning Rate | MAE |
---|---|---|

Adagrad | 0.1 | 1.56 |

Adagrad | 0.03 | 1.551 |

Adagrad | 0.05 | |

Adam | 0.01 | 1.597 |

Adam | 0.001 | 1.565 |

Adam | 0.0001 |

This study compared the predictive performance of four machine learning algorithms, SVM, SVR, Logistic Regression (LR), Random Forest (RF), and Ordinal Regression (OR). Our model uses dimension reduction to solve the problem of excess input dimension and little data in the training process. Apart from this, the machine learning model uses PCA (Principal Component Analysis) dimension reduction method, but our proposed model has a lower MAE than other machine learning models. The results are shown in

The regression algorithm, SVR, has lower MAE than our model in the first year after the students graduated, but it has poor results on the data of two or three years after the students graduated. This is due to the imbalance of data, most of the data fall behind the grading so that the regression algorithm does not predict the first few grading.

Use the two-three-year model of students after graduation instead of using the salary grade data of previous years as input features, which makes the prediction result to have a better performance after 25.6 K grading. However, the data of two or three years after the student’s graduate is extremely imbalanced and there is only a small amount of data so that there is not enough data for the first few grading to train. It can be seen from

MAE | Accuracy | Near accuracy | |
---|---|---|---|

Logistic Regression | 1.739 | 30.45% | 57.57% |

SVM | 1.778 | 31.64% | 59.57% |

SVR | 1.519 | 24.60% | 61.17% |

Random Forest | 1.789 | 30.58% | 56.78% |

Ordinal Regression | 1.642 | 21.27% | 60.10% |

Our Model | 1.529 | 25.00% | 61.43% |

MAE | Accuracy | Near accuracy | |
---|---|---|---|

Logistic Regression | 1.884 | 49.06% | 72.93% |

SVM | 1.742 | 48.62% | 74.23% |

SVR | 1.667 | 41.28% | 77.06% |

Random Forest | 1.802 | 46.10% | 72.47% |

Ordinal Regression | 1.775 | 39.60% | 74.15% |

Our Model | 1.510 | 45.41% | 77.37% |

MAE | Accuracy | Near accuracy | |
---|---|---|---|

Logistic Regression | 1.777 | 52.65% | 78.97% |

SVM | 1.871 | 54.00% | 79.22% |

SVR | 1.804 | 45.56% | 79.34% |

Random Forest | 1.807 | 51.72% | 77.92% |

Ordinal Regression | 1.719 | 47.47% | 76.75% |

Our Model |

This study used deep neural network, which is a lower MAE than machine learning method, to predict graduates’ salary grading. The study built a salary grading prediction model by ordinal regression, which can be extended to other ordinal data type prediction problems, such as students’ Grade Point Average (GPA) prediction. The proposed model provides relevant information on salary trends to the school’s researchers who can use this information to provide the corresponding strategy to help raise the students’ salary of students when they are employed.