In the field of computer research, the increase of data in result of societal progress has been remarkable, and the management of this data and the analysis of linked businesses have grown in popularity. There are numerous practical uses for the capability to extract key characteristics from secondary property data and utilize these characteristics to forecast home prices. Using regression methods in machine learning to segment the data set, examine the major factors affecting it, and forecast home prices is the most popular method for examining pricing information. It is challenging to generate precise forecasts since many of the regression models currently being utilized in research are unable to efficiently collect data on the distinctive elements that correlate y with a high degree of house price movement. In today’s forecasting studies, ensemble learning is a very prevalent and well-liked study methodology. The regression integration computation of large housing datasets can use a lot of computer resources as well as computation time, and ensemble learning uses more resources and calls for more machine support in integrating diverse models. The Average Model suggested in this paper uses the concept of fusion to produce integrated analysis findings from several models, combining the best benefits of separate models. The Average Model has a strong applicability in the field of regression prediction and significantly increases computational efficiency. The technique is also easier to replicate and very effective in regression investigations. Before using regression processing techniques, this work creates an average of different regression models using the AM (Average Model) algorithm in a novel way. By evaluating essential models with 90% accuracy, this technique significantly increases the accuracy of house price predictions. The experimental results show that the AM algorithm proposed in this paper has lower prediction error than other comparison algorithms, and the prediction accuracy is greatly improved compared with other algorithms, and has a good experimental effect in house price prediction.
One of China’s pillar businesses is the real estate sector. People may make wise home buying decisions and reduce the burden of government regulation by making logical use of a lot of real estate information. Data analysis is being developed in tandem with machine learning. The second-hand housing data can be scientifically processed by the regression model offered by the algorithm and converted into a data format that can be properly evaluated. Based on the evaluation of several regression methods, the AM algorithm continuously improves its statistical capabilities and achieves significant gains in data processing and conclusion prediction.
In the analysis of realistic prediction, establishing a model that is too simple will result in poor model fitting and the conclusion that the prediction is underfitted; establishing a model that is too complex, even though it can fit the existing data well, will cause the issue of overfitting and have an impact on the prediction results. Its goal is to choose several regression models with strong impacts from a variety of perspectives, and then assign a certain weight to each model. The use of the model average approach relies heavily on the weight computation. A weighted mix of model selection and model averaging. Its approach is more thorough for data processing than the model selection method, and it makes it difficult to overlook any model. As a result, several regression algorithms have evolved into a highly useful technique for estimating and researching the cost of used homes. Regression models’ prediction accuracy can exceed 80%, which is very helpful for the real estate sector and allows for very useful decision-making.
Both local and international academics employ a variety of techniques to study and forecast property prices. The two general categories for statistical analysis of house price data are as follows: One is time series data, while the other is panel data (Wang, 2006) [
The estimation function of the regression model algorithm is established. After obtaining the function, the test data is added to these models, and the model is used to process the data scientifically.
To establish the linear regression model, we need to preprocess the acquired data, and then train the model with relevant algorithms, and then apply the model. Curve fitting is used to analyze the distribution process of the data set, and the fitted curves are finally obtained as straight lines, so linear regression is completed. In addition, the hypothesis function needs to be set in the linear regression, and the hypothesis function adopted in this paper is [
Among them, Y is the prediction result of the model (predict house price) to distinguish the real house price, X represents the characteristic factors affecting the change of house price, and a and b represent the parameters of the model.
In the process of data preprocessing, it can be known from the above analysis that the main influencing factor of housing price is housing area information. Data normalization is then performed to prevent the range of these different data ranges from causing the calculated floating point to rise or fall. After observing the distribution characteristics of the data, the normalization operation is completed. By deleting the mean value and dividing the original value range, the value range of these same dimension attributes is scaled and narrowed.
When the loss function is defined, the fluid interface is called to realize variance calculation and obtain variance average, then the input is defined as the predicted value of housing price, while label data is defined, and then the loss value is continued to be calculated. The loss function is optimized by gradient descent [
After the model’s initialization, network structure configuration, construction and optimization of the training function, and model training. Prior to using the fluid interface to train and test the model, the actuators are first defined. The trainer can offer a certain optimization method, training hardware location, and related network architecture structure once it has been developed.
An evaluation of the net present value, expected value greater than zero probability, and risk assessment of the entire project experiment can be done using a decision tree, a type of prediction algorithm that uses a specific scenario [
The Decision Tree’s meaning contains a lot of information, and all of its internal nodes were specifically stated in terms of certain test qualities. Each type of leaf node within the structure might correspond to a variety of different groupings, and each type of branch information is intended to represent an output test.
The Decision Tree must be trained to collect a variety of data, and the data must be used to classify the training sample using a particular algorithm that creates specific properties. These properties or categories must then be implemented to determine these properties or categories, and the resulting classifier produced by machine learning algorithms can provide the final accurate classification of the newly emerged objects [
In order to represent the “strong” integrated information, the Random Forest integrates additional learning models based on decision trees, continually collects the “weak” aspects throughout the entire experiment process [
A new tree is used to fit the last complete forecast error when after the completion of the training we get k of the tree, and all need to predict sample points in accordance with the pertinent characteristics of the sample. The XGBoost algorithm obtains the characteristic information of constantly splitting these features to form a tree [
The XGBRegressor() function is immediately invoked for the model accuracy test in order to retrieve the accurate value [
In the process of constructing a neural network, the computational output is described in a way similar to “class and object” [
All the features of the input and the predicted values of the final output are represented by vectors. The input feature x contains 16 components, and the input feature y has only one component. After that, the neural network is constructed. X_{1} through the model could obtain the training to the representation of specific influencing factors for the prices that can meet the needs of z, the whole experiment actual prices to y, and on the need to complete this distinction specific indicators to measure the predictive value of access to z y actual gap with the real data, this article adopts the method of the mean square error to measure the accurate indicators of the model [
In the formula, “Loss” represents the Loss Function, which is used as a measurement index. After that, the loss of samples continued to be calculated. The Network class interface created was used to calculate the predicted value and obtain the loss function, and the dimension information was defined, including the feature vector and the number of samples [
By continuing to calculate the gradient, a smaller loss function can be obtained, and the parametric solutions w and b can be found to obtain the minimum value of the loss function.
The initial mock test model may offer the best model for researchers, however the one model selected may have flaws such being unreliable, lacking important information, being high risk, and target deviation. The model averaging method was created to address these drawbacks. It is a method of model selection that extends smoothly from estimate and prediction. After the hatchet is weighted by the weight of the candidate models, the first mock exam tosses the helve. The danger of a single model “single throw” is avoided. How to distribute the weight is the most crucial issue [
The Model averaging method was first proposed by Buckland in 1997. Its core is the score based on information standards. Since then, many scholars have done more in-depth research on this problem. After that, some scholars proposed a hybrid adaptive regression method-arm method, which is used to combine the estimators of regression function based on the same data. After that, some scholars proposed the arms model averaging method and verified the effectiveness of the method by comparing it with the EBMA method. Hansen (2007) proposed the mallows criterion for selecting weights, that is, the combined model is given appropriate weights by minimizing the mallows criterion, which is proved to be asymptotically optimal in the sense of minimizing the square error of independent data, and proposed mallows model averaging in 2008 (MMA) model averaging method is applied to time series data to further verify the accuracy and effectiveness of mallows criterion [
Based on the analysis of relevant regression models, AM Algorithm model is proposed in this paper. The algorithm averages the obtained regression models and then gives weight to each average model by using the comprehensiveness of the method so that the better model will not be overfitted and the sub-optimal model will not be over interfered, to obtain a more scientific house price prediction model [
The optimum data properties are first examined once the data has been preprocessed. The outcomes of the prediction are then produced after building the regression model to test the training data.
The FMA method is used in the study of the model average, and its core is to correctly define the combination weight, A typical example is the smoothed AIC(S-AIC) and smoothed BIC(S-BIC) methods introduced by Buckland (1997) and others. The final model is obtained by giving appropriate weights to multiple candidate models. The calculation formula for combination weight is [
In which the
The problem can be more thoroughly analyzed using the model averaging method, which also does not readily rule out any viable models. Unfortunately, the quantity of calculations required for the model averaging method would significantly rise when examining multivariable situations. The specific candidate models chosen in the experiment will be averaged in a model with n variables when utilizing the model averaging approach. It can assess data and make predictions with accuracy for the model with fewer variables [
The operating system used in the experiment is Ubuntu 18.04, the CPU is Intel Xeon(R)CPU E5-2609 v4 @ 1.70 GHz, the GPU is Nvidia 1080Ti with 8 GB of memory, the development language is Python of 3.6.4.
The datasets are based on second-hand housing data supplied by Baidu Open AI for significant Chinese cities. There is a lot of incorrect information in the obtained data set, and after detailed analysis of the data, the information is classified according to characteristics. The data used by the model algorithm is split into a train set and a test set, with a proportion of 70% and 30%, respectively.
The characteristic values of the gathered datasets are extracted using the machine learning correlation technique, and the correlation between the characteristic values is calculated. The price curve is observed, the correlation feature image is utilized to examine how one characteristic affects the other, to process and evaluate the price, and to ultimately arrive at the prediction result.
The missing value information processing method is used to process the chaotic code anomaly information, and the K-means algorithm is used to fill in the anomaly value. A two-point technique is used to describe data such as elevator, subway, decorating, and tax status, and feature information unrelated to regression classification is immediately eliminated. Additionally, the data are divided into various variables after filling out the form, the continuous data variables are enlarged, and various data kinds are processed by hot code [
Numbers of data sets are used above the research process for classification information processing and correlation analysis to determine the factors that influence house price. Then, the feature with the highest correlation is chosen to analyze the change in house price, and various regression models are used to predict the data set. The most innovative aspect of this research was its analysis of how different regression models processed price data sets, its examination of the influences of various characteristics on house prices, its evaluation of the model’s ability to predict prices, and its subsequent adjustment with grid parameters to improve that ability.
Fill in the missing values. For example, there is no Subway in the data set in earlier years, so these blanks need to be filled in the Subway column feature to add information without Subway. Perform visual operations on the data and make relevant predictions according to these specific operations.
Specific steps to achieve the prediction
In
The decision tree is used to train the influence of housing area on price factors. In
Random Forest reduces variance based on decision trees and train multiple sample trees. However, the prediction results of this model are realistic. Sparse points do not obtain the prediction results. For example in
Random Forest and Decision Tree are the same types of regression algorithms.
All the features excluding the unexpected price features are used as the influencing factors of house prices to evaluate the model, which can maximize the prediction effect of the model. The evaluation will train the model with multidimensional data as the basis for the prediction effect of the model.
Expand the continuous variable base and encode the classified variables. The algorithm is standardized. For example, see
Unstandard_R^{2} | Standard_R^{2} | Standardize_Data_R^{2} | |
---|---|---|---|
LinearRegression | 0.854 | 0.885 | 0.832 |
SVR | 0.024 | 0.882 | 0.882 |
Lasso | 0.885 | −0.001 | 0.854 |
DecisionTree | 0.849 | 0.855 | 0.853 |
ExtraTree | 0.713 | 0.700 | 0.725 |
RandomForest | 0.879 | 0.878 | 0.878 |
AdaBoost | 0.808 | 0.804 | 0.797 |
Bagging | 0.865 | 0.874 | 0.873 |
R^{2} is a statistical measure that represents the proportion of variance of a dependent variable explained by one or more independent variables in the regression model. Correlation shows the strength of the relationship between independent variables and dependent variables, while R^{2} shows the extent to which the variance of one variable explains the variance of the second variable [
The definition of loss function uses the form of a formula to measure the difference between the obtained prediction results and the real data. The smaller difference means better prediction results. For linear models, the loss function often used is the mean square error:
For the test set of n obtained data, MSE is the mean information of the square error of n obtained prediction results.
MAE is the average of the absolute value of the error between the observed value and the real value. where
RMSE: the root mean square error, the root mean square deviation represents the sample standard deviation of the difference between the predicted value and the observed value. where
This paper mainly uses MSE and R^{2} as the evaluation index of the prediction effect of the model. MSE is a measure of estimator quality. It is always non-negative, and a value close to zero is better. Therefore, MSE is very appropriate and effective as an evaluation index. Regression algorithms such as Random Forest and Decision Tree can also obtain an accuracy of more than 80% after parameter adjustment [
The mean square error, root mean square error, mean absolute error and R^{2} value are used as the evaluation function. The smaller the mean error of mean-variance, the better the prediction effect of the model, and the larger the R^{2} value, the higher the efficiency of the model. It can be seen from
MSE | MAE | RMSE | |
---|---|---|---|
LinearRegression | 0.518 | 0.282 | 0.398 |
SVR | 0.143 | 0.224 | 0.378 |
Lasso | 1.060 | 0.804 | 0.030 |
DecisionTree | 0.170 | 0.275 | 0.412 |
ExtraTree | 0.313 | 0.385 | 0.560 |
RandomForest | 0.142 | 0.255 | 0.377 |
AdaBoost | 0.205 | 0.329 | 0.453 |
Bagging | 0.144 | 0.274 | 0.390 |
Through the comparison of loss functions, it can be found that among MSE mean square error and MAE, AM algorithm obtains the best effect, obtains the optimal data coupling, and avoids overfitting of Gradient Boost and under fitting information of Linear Regression.
Define the calculation method of the loss function. The KFold [
After evaluating all the regression models, this paper proposes a new algorithm: AM algorithm, which internally optimizing the models, stacking the predicted results, and then calculates the average value to get a better optimization effect.
By averaging the established regression model, a model with the highest coupling is obtained: am algorithm model.
RSME_MEAN | RSME_STD | |
---|---|---|
Lasso | 0.355 | 0.027 |
Linear regression | 0.355 | 0.028 |
Decision tree | 0.371 | 0.056 |
SVR | 0.288 | 0.061 |
Random forest | 0.308 | 0.053 |
The AM algorithm has the most potent integrated algorithm ability and has a better impact on price prediction, according to the examination of the final prediction effect and error function. The AM method has a smoother processing mode and a better fitting effect on standard data, regardless of the implementation mode, parameter use, or data processing.
The AM algorithm can process the model with the minimum gradient according to the information in the data set in the experiment on the data sets for second-hand housing, and utilize the mean sum approach to get good prediction results for various regression models. Additionally, the AM approach can enhance the coupling strength of the model and adapt well to the error handling techniques used in different regression models. This technique has a wide range of applications and will not readily discard any candidate model to produce more complete results. Only the model average approach and principal component analysis method are combined in this work. Presumably, many traditional methods can be combined with AM algorithm method, which may achieve better results in analyzing and predicting practical problems.
To assess and forecast the second-hand house sales price index, the study combines the time series model with the model selection and model average approach. The AM algorithm is established, the autocorrelation function and partial autocorrelation function of the data are observed, many candidate models are obtained, the MSE and R2 values of each model are calculated and compared, the better model is chosen, and a new model is established using the AM method. This new model is then used to predict the second-hand housing sales price index, the prediction errors of which are compared, and it is ultimately decided that the AM algorithm is the most accurate. In addition, the AM algorithm proposed in this paper has many contributions in the field of regression prediction in machine learning, including being able to be flexibly applied to the regression analysis of stock development trend, the regression analysis of weather index change trend, and so on. Moreover, the implementation efficiency of the algorithm is high and takes up less resources, and the algorithm such as comparison set learning is more simple and effective.
This work was supported in part by Sichuan Science and Technology Program (Grant No. 2022YFG0174) and in part by the Sichuan Gas Turbine Research Institute stability support project of China Aero Engine Group Co., Ltd (Grant No. GJCZ-0034-19). (Corresponding author: Yong Zhou).
The authors declare that they have no conflicts of interest to report regarding the present study.