Accurate forecasting of time series is crucial across various domains. Many prediction tasks rely on effectively segmenting, matching, and time series data alignment. For instance, regardless of time series with the same granularity, segmenting them into different granularity events can effectively mitigate the impact of varying time scales on prediction accuracy. However, these events of varying granularity frequently intersect with each other, which may possess unequal durations. Even minor differences can result in significant errors when matching time series with future trends. Besides, directly using matched events but unaligned events as state vectors in machine learning-based prediction models can lead to insufficient prediction accuracy. Therefore, this paper proposes a short-term forecasting method for time series based on a multi-granularity event, MGE-SP (multi-granularity event-based short-term prediction). First, a methodological framework for MGE-SP established guides the implementation steps. The framework consists of three key steps, including multi-granularity event matching based on the LTF (latest time first) strategy, multi-granularity event alignment using a piecewise aggregate approximation based on the compression ratio, and a short-term prediction model based on XGBoost. The data from a nationwide online car-hailing service in China ensures the method’s reliability. The average RMSE (root mean square error) and MAE (mean absolute error) of the proposed method are 3.204 and 2.360, lower than the respective values of 4.056 and 3.101 obtained using the ARIMA (autoregressive integrated moving average) method, as well as the values of 4.278 and 2.994 obtained using k-means-SVR (support vector regression) method. The other experiment is conducted on stock data from a public data set. The proposed method achieved an average RMSE and MAE of 0.836 and 0.696, lower than the respective values of 1.019 and 0.844 obtained using the ARIMA method, as well as the values of 1.350 and 1.172 obtained using the k-means-SVR method.

Time series data, a collection of data recorded at specific intervals over a given period, are prevalent in various domains, such as finance [

Recurring and significant time series patterns are called events. Some traditional time series prediction methods use all time series data to construct models. This not only causes inefficiency but also makes it impossible to quickly track the short-term changes in trends. Moreover, these methods almost always require the data to have a steady state, i.e., a linear model with linear correlations in the internal data, which is difficult to fit effectively for nonsmooth data with high volatility. Therefore, it is necessary to detect various granularity events to improve the matching degree. These events often have varying lengths and do not have an “either/or” relationship. Instead, they can be inclusive or partially overlapping. Even minor differences in events can bring significant errors when trying to match them with future trends in time series analysis. Therefore, it is crucial to effectively match events of different granularities, which is a challenging task in time series analysis. Traditional methods have limitations when it comes to matching multi-granularity events, resulting in ineffective model fitting and a decline in model prediction performance. Furthermore, once events have been matched, they cannot be directly used as a state vector to construct machine learning-based prediction models because their durations may differ. Therefore, dimension alignment is also necessary after matching the events, which is another challenging task in time series analysis.

In summary, the innovations and contributions of our work include the following five points:

1. A methodological framework, called MGE-SP (Multi-granularity Event-based Short-term Prediction), is proposed to guide the implementation steps of our method, MGE-SP (multi-granularity event-based short-term prediction). This framework incorporates our previous work on multi-granularity event detection based on self-adaptive segmenting [

2. A method of multi-granularity event matching that uses the LTF (Latest Time First) strategy is proposed. This method enables the matching of a real-time time series with multi-granularity events.

3. A method of time series alignment that uses piecewise aggregate approximation based on compression ratio is proposed. This method allows for the alignment of multi-granularity events.

4. A time series prediction model based on XGBoost is constructed. In this model, the aligned event instances serve as the state vectors for training the prediction model.

5. To demonstrate the universality of our method, experiments are conducted using two datasets from different domains: a customized passenger transport dataset and a stock dataset. The experiments involve comparing our method with traditional models such as ARIMA and k-means-SVR using the entire dataset.

The paper is organized as follows. In

In the field of time series analysis, time series prediction is an important research topic. Predicting the data trend can help users make reasonable decisions and plans, which is widely used in finance [

Time series prediction methods can be roughly classified into two categories: linear and nonlinear prediction. Traditional linear prediction methods, such as ARMA (autoregressive moving average) [

The above models are based on the assumption that there is a linear relationship between the historical data and the current data of the time series. However, for time series data with nonlinear characteristics, the linear prediction model is difficult to fit effectively. Due to the limitations of linear prediction methods, machine learning-based prediction methods, such as the hidden Markov model (HMM) [

A neural network is a supervised machine learning method that can effectively represent the mapping relationship between all kinds of nonlinear data, including time series, and solve nonlinear problems in many fields. BPNN (back propagation neural network) and RBFNN (radial basis function neural network) [

The key to the prediction method based on machine learning is the construction of a state vector. At present, most methods slide the time series by designating a fixed-length window. This window allows for the time series to be divided into several equal-length segmentations, which are subsequently stored in a pattern database through clustering techniques. Then, pattern matching is performed on the real-time data, and the matching pattern is used as the state vector to construct the prediction model. Therefore, setting the sliding window will have a considerable impact on the performance of this type of method. When there are multiple patterns of different lengths in the time series, pattern matching becomes more difficult.

A time series is a sequence of data values that occur in successive order over a specific time period. In this regard, the concept of a time series can be defined as follows.

_{1},…, _{n}) represents a sequence of data values with a length of _{i}.

The time series should be segmented before it is used. If it is directly used to construct a prediction model, the consumption of computing resources will be high. The noise in the time series may affect the model’s predictability to capture the trend. The definition of segmentation is given.

_{1},…, _{n}), if a start position _{i}, _{i+1},…, _{j}) is called a segmentation. The time granularity of _{j}. _{i}.

According to the trend of a time series, edge points are used to divide the time series into different segmentations. To measure all the possible edge points, an evaluation criterion, i.e., edge amplitude is introduced, which represents the trend difference between any two adjacent segmentations, denoted as _{1}, _{2}, on both sides of a point _{i}. If the edge amplitude of _{i} is larger, the point is more likely to be labeled as an edge point. As shown in

In the context of a time series, an event refers to a significant incident that manifests itself as a pattern within the time series data. For example, a traffic block can be identified as an event, which is depicted as a subsequence within a time series representing bus speeds. This study establishes a correlation between events and the occurrence of frequent patterns within a time series. The higher the frequency of a pattern appearing in a time series, the greater the probability of it being classified as an event. The subsequent definition outlines the characteristics of an event.

_{1},…, _{n}), a pattern

In an event set

_{1},…, _{m}) and e2 = (_{1}´,…, _{n}´), there are three possible relationships between them: 1) for any _{i} in e1 and _{j}´ in e2, _{i} ≠ _{j}´; 2) there are subsequences (_{i},…, _{i+l}) in e1 and subsequences (_{j}´,…, _{j+l}´) in e2, (_{i},…, _{i+l}) = (_{j}´,…, _{j+l}´); and in 3) there is a subsequence (_{1}´,…, _{m}´) in e2, _{1},…, _{m}) = (_{1}´,…, _{m}´), and the events in set

Given a time series _{1},…, _{n}) and a multi-granularity event set

To solve the problem, a systematic approach involving several steps is necessary. These steps include detecting, matching, aligning multi-granularity events, and predicting time series data. To facilitate this data analysis process, a methodological system, the Multi-granularity Events Short-Term Prediction (MGE-SP) framework, is proposed, as shown in

In the phase of event detection, segmenting a time series is the first step. By symbolizing time series and a self-adaptive segmenting algorithm, a historical time series is partitioned into multi-granularity events. The multi-granularity event detection method based on self-adaptive segmenting [

In the phase of event matching, a real-time time series, which is symbolized, will be matched with the events in the multi-granularity event database. The method of event matching is different depending on different requirements. In this paper, an LTF (Latest Time First) strategy is proposed to match multi-granularity events.

In the phase of event aligning, event instances with different time scales are aligned to an equal length, so that they can be used as a state vector to construct a machine learning-based prediction model. A CRPAA (compression-ratio-based piecewise aggregate approximation) method is proposed to align the multi-granularity events. Of course, other aligning methods can also be employed in this phase if there is a special requirement to be met.

In the phase of time series short-term predicting, an XGBoost-based prediction model is constructed. The aligned event instances as state vectors are used to construct the short-term prediction model of the time series.

For most prediction methods based on machine learning, selecting the state vector is one of the important factors that determine the performance of the prediction model. In most of the traditional methods, the sliding window is used to segment a time series, and the similarity distance between the segmentations is calculated to match the time series pattern. The prediction model is constructed according to the time series patterns. However, if the source time series patterns are all truncated to equal length, the event matching must result in a worse effect, and finally, the model is unable to be fit effectively. To solve this problem, a method of multi-granularity event matching based on the LTF (Latest Time First) strategy is proposed. The main idea of the strategy is as follows.

Given a symbolic representation of a real-time time series _{1},…, _{n}), set _{h}) = (_{n-h+1},…, _{n}) the latest subsequence of _{h}) with length _{h}) is matched layer by layer. If there is a node _{i,j} in _{i,j}._{h}), a matched event is found, and _{i,j}._{i,j} cannot be matched, the symbol farthest from the current time in _{h}) is removed. A new subsequence _{h−1}) is constructed. The process is repeated until successful matching is achieved. If the event tree _{1}), the nearest symbol in the

An event tree is created as a suffix tree, in which the root node only serves to index the child nodes. A non-root node _{i,j} stores two kinds of information: a subsequence, denoted _{i,j.sequence}, whose frequency is greater than _{i,j.startindex}, in the symbol sequence. More details of the event tree can be found in previous work [

To match a symbol sequence

Step 1: Match the subsequence _{h}) by searching the tree

Step 2: Remove the farthest symbol in the _{h}) from the current time, i.e., _{h}) is not zero, a new subsequence _{h}) is constructed, and return to step 1;

Step 3: Determine that the symbol subsequence _{h}) is an exception if there is no successful match to be found in _{h}). Remove it from

Step 4: Output the matched node. To match a new symbol subsequence, return to step 1.

The pseudocode of the matching symbol sequence is shown in Algorithm 1.

To ensure the timeliness of matching events, an event rematch is required following new time series data. If the last node that has already matched is _{i,j}, suppose that the symbolized representation of the new data is (_{k+1},…, _{k+m}), and a new match of subsequence (_{k+1},…, _{k+m}) starts. The root node of the event tree is _{i,j}. If (_{k+1},…, _{k+m}) can successfully be matched at some node in the event tree, the node is the solution. For example, given a symbol sequence _{1,1}, _{1,2}, and _{1,3} using symbol _{1,3} stores a subsequence of _{1,3} is selected. The second step is to match each of the child nodes, _{2,6} and _{2,7} of node _{1,3} using the first two symbols of the subsequence (_{2,7} stores the subsequence (_{2,7}.

The process of matching a real-time time series using the LTF strategy can obtain all event instances in historical multi-granularity events. However, the duration of event instances may not be the same. For example, all event instances of an event discovered from the dataset, ToeSegmentation1 (

To serve this purpose, an event-aligning method CRPAA (compression-ratio-based piecewise aggregate approximation) is proposed. For a given set of event instances, CRPAA divides all event instances into an equal number of segmentations. Then, each segmentation is approximated by its mean value so that every event instance is mapped into the same low-dimensional space.

Many PLR (piecewise linear representation) methods [_{1},…, _{n}) and a piecewise linear function _{i} is the

The compression ratio is often used to evaluate the performance of PLR algorithms. The compression ratio refers to the ratio of the sequence length before and after compression. Given a time series _{1},…, _{n}), if a new sequence of length

The higher the compression ratio is, the lower the length of the sequence compressed by the PLR algorithm, and the better the compression effect. Given a time series _{1},…, _{n}) and a time window of length _{i} is calculated by

The sequence obtained by using PAA to compress the original sequence can provide a tight lower bound distance measurement, which can effectively control the error with the original time series and provide a better representation effect [

Suppose that _{h}) is the symbolic representation of the latest subsequence in a real-time time series, where _{1}, _{2},…, _{n}. For any event instance _{i} with a length of _{i}, _{i} is denoted as _{i,j} is the _{h}) can be calculated by _{i} is divided into a new approximation of the segmentation sequence _{i,j}, there is at least one symbol corresponding to an event instance _{i}. Finally, event instance set {_{1}, _{2}, …, _{n}} can be approximated as segmentation sequences

The pseudocode of CRPAA is shown in Algorithm 2.

The performance of CRPAA depends on the compression ratio. The lower the compression ratio is, the more information is retained. However, in this situation, there is the problem of dimensional disaster, although the original time series can be better characterized. In contrast, using a higher compression ratio causes the loss of more information and a lower performance of model fitting, although the dimensional disaster can be avoided and the model fitting process is fast. Therefore, the key to improving CRPAA performance is finding the balance point between maximizing dimensionality reduction and retaining trend information. In time series analysis, a fitting error [

Given a time series _{1},…, _{n}), using CRPAA,

For a set of event instances {_{1}, _{2}, …, _{n}}, using CRPAA, if all the event instances are compressed, the cumulative fitting error is calculated by

For example, CRPAA is used to compress the event instances obtained from the dataset ToeSegmentation1, as shown in

Based on the analysis, a compression ratio search algorithm is proposed. The cumulative fitting error is calculated iteratively by setting an initial compression ratio of 50%, rising by increments of 5% each time to 95%. All the cumulative fitting errors are sorted into a sequence. While searching the sequence using the method in [

Based on the process of dimension alignment, XGBoost is used to construct the short-term prediction model of the time series. XGBoost is a boosting model based on a decision tree, which has advantages, such as fast training speed and a good fitting effect [

For a given dataset with _{i} corresponds to a real-valued label _{i}, the prediction model of XGBoost consisting of K weak classifiers is represented by

The objective function of XGBoost is constructed by a loss function plus a penalty function of the regularisation term, as shown in _{i} and the prediction value _{j} is the weight of the

XGBoost is essentially an integration that uses the residual to construct multiple weak classifiers that work together to make the final decision. Therefore, the objective function can be further represented as follows in

To construct the short-term prediction model of the time series, the aligned event instances as state vectors are used to train the model. The training framework for the short-term prediction model is shown in

The Bayesian optimizer constructs a posterior model based on the performance of the available samples in the optimization objective function. In each iteration, depending on the previous observed historical performance, the next optimization is performed, continuously updating the posterior probability model to search for the local optimal parameters in the optimization objective function. As the entire machine learning framework has a large parameter space, Bayesian optimization is slow to start. A meta-learning approach is used for Bayesian optimization. Based on meta-learning, many configurations are selected, and their results are used to seed Bayesian optimization. The whole machine-learning process is highly automated and can save users much time.

The automated training framework based on the Bayesian optimizer is used to optimize the parameters of XGBoost. There are three steps altogether:

1. Align the event instances with which a real-time series time is successfully matched together with the future observations corresponding to the event instances using CRPAA. Then, the event instances are input into the training framework as true values.

2. Initialise the Bayesian optimizer using meta-learning.

3. Predict future observations using the parameters of the current model to evaluate the difference between the true value and the predicted value and to optimize the parameters using the Bayesian optimizer according to the difference.

To evaluate the performance of the MGE-SP, we took two experiments, a customized passenger transport in Xiamen, China, and a stock to analyze our method. Moreover, for a quantitative analysis, two methods, the ARIMA method [

The dataset is collected from an online ride-hailing service company in Xiamen, China. We extracted some of the orders from June to December 2019 (_{i} in a time series T = (_{1},…, _{n}) is a statistic on the number of passengers on board at different periods. The period is usually half an hour.

OrderID | Boarding time | Num | Boarding position | Get-off position |
---|---|---|---|---|

XX1089 | 2019/6/1 6:40 | 1 | (119.7892229, 25.510116) | (119.3058661, 26.08521756) |

XX7605 | 2019/6/1 9:27 | 1 | (119.270257, 26.08194055) | (119.8009559, 25.5116319) |

XX0693 | 2019/6/1 6:30 | 2 | (119.800145, 25.510018) | (119.3093007, 26.08156558) |

XX9713 | 2019/6/1 20:25 | 1 | (119.3266432, 26.11967494) | (119.8015237, 25.50211781) |

XX0123 | 2019/6/1 18:14 | 2 | (119.806641, 25.49598875) | (119.2573591, 26.03864469) |

To make a better experimental comparison, the time series of passenger flow is further divided into two parts. In this experiment, the data are used as the training set except for the data from the last 5 days (December 27, 2019 to December 31, 2019). The data of the last 5 days are used as the testing set, which is selected to be compared with the prediction results, as shown in

The experiment has three steps: 1) the time series in the last five days are matched with the multi-granularity events using the LTF strategy; 2) the real-time data and the event instances are aligned using CRPAA; and 3) the XGBoost model for short-term prediction is constructed, during which the dimension-aligned event instances are used as the state vectors.

Before the three steps, the multi-granularity events should be resolved using the SSED method, which was proposed in [

Step 1) The time series in the last five days are matched with the multi-granularity events using the LTF strategy.

After symbolizing the data of the prediction day, the real-time sequence of 30 min is matched with the event instances based on the LTF strategy. To better simulate the real situation, the event matching is restarted every 30 min.

Step 2) The real-time data and the event instances are aligned using CRPAA.

After extracting the event instances with which the real-time sequence is successfully matched, the CRPAA algorithm is used to align the dimensions of the event instances. As the main parameter of CRPAA is the compression ratio, Algorithm 3 is used to search for the optimal solution for the compression ratio.

Step 3) The XGBoost model for short-term prediction is constructed, during which the dimension-aligned event instances are used as the state vectors.

The event instances processed in step 2 are used as the state vector to construct the short-term prediction model based on XGBoost. In this paper, the grid search algorithm is used to automatically optimize the parameters involved in XGBoost [

For quantitative analysis, two methods, the ARIMA method proposed in [

In this paper, MAE (mean absolute error) and RMSE (root mean square error) are used to evaluate the prediction effect, as shown in

_{i} represents the actual value at time

Fundamentally, MAE and RMSE measure the same item: the average distance of the error between the true value and the predicted value. The lower the MAE and RMSE values are, the better the model’s prediction. The comparison of the MAE and RMSE of the three methods is shown in

Groups of experiments | RMSE | MAE | ||||
---|---|---|---|---|---|---|

ARIMA | k-means-SVR | MGE-SP | ARIMA | k-means-SVR | MGE-SP | |

12–27 2019 | 3.97453 | 3.81564 | 3.13030 | 2.77939 | ||

12–28 2019 | 4.25032 | 4.78262 | 3.18051 | 3.13520 | ||

12–29 2019 | 3.16465 | 3.59357 | 2.53261 | 2.67083 | ||

12–30 2019 | 4.34110 | 4.16940 | 3.39810 | 2.99211 | ||

12–31 2019 | 4.55266 | 5.02953 | 3.26447 | 3.39269 | ||

Average | 4.05665 | 4.27815 | 3.10119 | 2.99404 |

K-means-SVR achieved the maximum error on December 28, December 29, and December 31. K-means-SVR matched the pattern by the sliding window and clustering. However, there were multiple granularity events in the passenger flow sequence. It is difficult to accurately identify the start time point and duration of multi-granularity events through a fixed time window, which is used by most pattern recognition methods. It causes the wrong judgment for patterns and makes the predicted value deviate too much from the true value.

MGE-SP has achieved the lowest error in the last five days. Compared with the other two models, MGE-SP tracks the trend of passenger flow more accurately, fits the true value better, and obtains the predicted value with the lowest deviation from the true value. MGE-SP matches events through the state vector represented by the symbol sequence of real-time data. The method breaks through the limitations of traditional methods that can only match events with fixed lengths.

The stock dataset is collected from the big data platform, Tushare (

The closing prices of the four stocks over the selected time frame are shown in

The differences between the predicted value and the true value of the stock closing price by MGE-SP, ARIMA, and k-means-SVR are shown in

Stock | RMSE | MAE | ||||
---|---|---|---|---|---|---|

ARIMA | k-means-SVR | MGE-SP | ARIMA | k-means-SVR | MGE-SP | |

Ping An Bank | 0.53580 | 0.79928 | 0.43003 | 0.54644 | ||

Vanke | 0.67195 | 0.68777 | 0.51178 | 0.51483 | ||

ZTE | 2.70651 | 3.76730 | 2.32380 | 3.52561 | ||

OCT | 0.16272 | 0.14688 | 0.10886 | 0.10061 | ||

Average | 1.01924 | 1.35030 | 0.84361 | 1.17187 |

In our research, we have focused on the analysis of multi-granularity event matching and alignment methods in the computer domain. We have integrated these methods with previously studied multi-granularity event segmentation techniques to create a comprehensive system named MGE-SP. This system facilitates the preprocessing of time series data and significantly improves the accuracy of short-term predictions. Our primary objective is to leverage MGE-SP to enhance the precision of our forecasting models.

We evaluated the results of MGE-SP with two indicators, RMSE and MAE. Through experimentation, MGE-SP demonstrates efficiency by achieving lower average RMSE and MAE scores, 3.204 and 2.360 for the first experiment and 0.836 and 0.696 for the second experiment, compared to other methods. This has led to an overall performance boost. The results reveal that the MGE-SP methodology outperforms traditional methods, supported by the data selected from different domains, reinforcing the universality of MGE-SP.

Our research contributes novel ideas and approaches to the domain of multi-granularity event matching and alignment. Specifically, we have developed the LTF strategy and algorithm, which effectively decreases matching errors arising from equal-length time series patterns. Additionally, we employ the piecewise aggregate approximation method, which leverages compression ratio to align the time scales of multi-granularity events. These aligned events can serve as state vectors for machine learning-based prediction methodologies.

Although MGE-SP can effectively reduce the prediction error of time series, the matching of large-scale multi-granularity events comes at the cost of sacrificing matching efficiency. Therefore, future work will focus on enhancing the efficiency of matching multi-granularity events. This may involve considering events as high-order features or exploring the potential of incorporating both high-order and low-order features in the matching process.

Thanks to Jiang Peizhou from the Xiamen GNSS Development & Application Co., Ltd. He provided the necessary scientific data for the work. Thanks to Dr. Mengxia Liang from Harbin Institute of Technology, who provided many valuable suggestions for the revision of this article.

This research was funded by the Fujian Province Science and Technology Plan, China (Grant Number 2019H0017).

The authors confirm contribution to the paper as follows: study conception and design: Haibo Li; data collection: Haibo Li; analysis and interpretation of results: Yongbo Yu, Zhenbo Zhao and Xiaokang Tang; draft manuscript preparation: Haibo Li, Yongbo Yu. All authors reviewed the results and approved the final version of the manuscript.

The data comes from publicly available datasets on Tushare (

The authors declare that they have no conflicts of interest to report regarding the present study.