Using time-series data analysis for stock-price forecasting (SPF) is complex and challenging because many factors can influence stock prices (e.g., inflation, seasonality, economic policy, societal behaviors). Such factors can be analyzed over time for SPF. Machine learning and deep learning have been shown to obtain better forecasts of stock prices than traditional approaches. This study, therefore, proposed a method to enhance the performance of an SPF system based on advanced machine learning and deep learning approaches. First, we applied extreme gradient boosting as a feature-selection technique to extract important features from high-dimensional time-series data and remove redundant features. Then, we fed selected features into a deep long short-term memory (LSTM) network to forecast stock prices. The deep LSTM network was used to reflect the temporal nature of the input time series and fully exploit future contextual information. The complex structure enables this network to capture more stochasticity within the stock price. The method does not change when applied to stock data or Forex data. Experimental results based on a Forex dataset covering 2008–2018 showed that our approach outperformed the baseline autoregressive integrated moving average approach with regard to mean absolute error, mean squared error, and root-mean-square error.

Stock-price forecasting (SPF) is an attractive and challenging research area in quantitative investing and time-series data analysis [

Many SPF approaches have been proposed in recent decades, such as traditional time-series analysis and forecasting [

To overcome the drawbacks of conventional SPF approaches, machine learning and deep learning have recently been introduced to analyze time-series data [

One study [

A study [

Others [

The present study proposes a method based on machine learning and deep learning to enhance the performance of SPF. We combined a feature selection–based extreme gradient boosting (XGBoost) model and a deep learning–based LSTM model. The XGBoost model automatically selects the most important features from a high-dimensional time-series dataset and discards redundant features. Then, we exploit the power of LSTM regression by using extracted features from the XGBoost model to forecast stock prices. We compared our approach to the performance of ARIMA using Forex data from 2008 to 2018. Our method was found to maintain generality when applied to both stock and Forex data.

Here, we introduce two approaches for SPF. An ARIMA model is used as a baseline for comparison with our approach.

ARIMA [

where

The parameters

We first applied extreme gradient boosting (XGBoost) as a feature-selection method to select important features for the purposes of prediction from high-dimensional time-series data and discarded redundant features. The selected features were fed into the LSTM model to forecast stock prices.

XGBoost [

Consider a dataset including _{j} represents the score on the

where

where

We fed the selected features based on XGBoost into the LSTM model for SPF. The LSTM model is an extension of RNN, reducing the effect of the vanishing gradient problem. The model significantly captures contextual information within a sequence or series; it can also capture the information of a sequence output based on past and future contexts. Note that the model is executable on sequences of arbitrary lengths. It learns the long dependencies of the inputs, captures important features from the inputs, and preserves the information over a long period.

Given a frame _{t} in the feature sequence _{1}_{T}, each time the LSTM unit receives _{t} into the sequence, it updates the hidden state, _{t}, with a nonlinear function that takes both current input _{t} and previous state _{t-1}. Specifically, given frame _{t} at current state _{t-1} is the hidden state at previous state _{t-1} is the cell state at previous state _{t}, the input gate _{t}, the output gate _{t}, and the candidate context

where _{t} and hidden state _{t} at current time

where

We evaluated our proposed method using a dataset collected from the Forex market [

5 min price dataset | 60 min price dataset | |
---|---|---|

Numbers of observations | 709314 | 59094 |

Mean value | 1.365315 | 1.365289 |

Standard deviation | 0.082423 | 0.082418 |

Min value | 1.188260 | 1.189927 |

Median | 1.353260 | 1.353211 |

Max value | 1.603050 | 1.600019 |

We randomly split the subdataset into two groups—approximately 70% for training and 30% for testing—to analyze the ARIMA model. Specifically, 41,365 observations were used as training data and 17,729 as test data. The training data were used to find the best parameters (

Lag | ACF | PACF |
---|---|---|

1 | 0.2264735 | 0.2264773 |

2 | 0.0025956 | −0.0513291 |

3 | −0.0033345 | 0.0081084 |

4 | −0.0022325 | −0.0033139 |

5 | 0.0047621 | 0.0061770 |

6 | 0.0111998 | 0.0090789 |

7 | 0.0094809 | 0.0052901 |

8 | –0.0047576 | –0.0081162 |

9 | –0.0001455 | 0.0034540 |

10 | –0.0101618 | –0.0117976 |

11 | –0.0092337 | –0.0044169 |

12 | –0.0156961 | –0.0139487 |

13 | –0.0115826 | –0.0054923 |

14 | –0.0015148 | 0.0018332 |

15 | 0.0061270 | 0.0060157 |

16 | 0.0047040 | 0.0020550 |

17 | 0.0104383 | 0.0101022 |

18 | 0.0186073 | 0.0151472 |

19 | 0.0091663 | 0.0023393 |

20 | 0.0060597 | 0.0043159 |

Log-likelihood | AIC | BIC | |||
---|---|---|---|---|---|

1 | 1 | 0 | 310104.521 | –620203.042 | –620176.081 |

1 | 1 | 1 | 310184.538 | –620361.076 | –620325.129 |

0 | 1 | 0 | 308548.840 | –617093.679 | –617093.679 |

0 | 1 | 1 | 310184.101 | –620362.202 | –620335.241 |

In the XGBoost and LSTM approaches, we randomly split the original dataset into three groups: approximately 60% for training, 20% for validation, and 20% for testing. We used 397,216 observations for training, 170,235 observations for validation, and 141,863 observations for testing. The observations were high-dimensional data with 200 features. We used XGBoost to select feature importance based on the F-score value.

Finally, we used MAE, MSE, and RMSE as the metrics to evaluate the accuracy of the SPF system. The lower the values, the more accurate the system.

Target | Predicted |
---|---|

1.293415 | 1.269521 |

1.294596 | 1.292982 |

1.294630 | 1.294977 |

1.294735 | 1.294543 |

1.294765 | 1.294777 |

1.267252 | 1.267472 |

1.268075 | 1.267196 |

1.267995 | 1.268281 |

1.267935 | 1.267924 |

1.267882 | 1.267934 |

1.267662 | 1.267866 |

1.266994 | 1.267610 |

Target | Predicted |
---|---|

1.267860 | 1.267744 |

1.268110 | 1.268210 |

1.268230 | 1.268661 |

1.268460 | 1.268563 |

1.268660 | 1.268538 |

1.252120 | 1.251820 |

1.252720 | 1.252840 |

1.252480 | 1.252201 |

1.252190 | 1.252538 |

1.252190 | 1.252215 |

1.252210 | 1.251961 |

1.252200 | 1.252945 |

1.251690 | 1.252231 |

Methods | MSE | MAE | RMSE |
---|---|---|---|

ARIMA | 6.114 x 10^{−7} |
4.149 x 10^{−4} |
7.819 x 10^{−4} |

XGBoost + LSTM | 3.465 x 10^{−7} |
3.825 x 10^{−4} |
5.887 x 10^{−4} |

This study proposed an improved SPF system by combining XGBoost and LSTM models. We first introduced the construction of important features from a high-dimensional dataset using XGBoost as the feature-selection method. Then, the features were fed into deep LSTM models to evaluate the performance of the forecasting system. The experimental results verified that the proposed approach significantly improved the accuracy of the SPF system and outperformed the baseline ARIMA approach.