The growing number of COVID-19 cases puts pressure on healthcare services and public institutions worldwide. The pandemic has brought much uncertainty to the global economy and the situation in general. Forecasting methods and modeling techniques are important tools for governments to manage critical situations caused by pandemics, which have negative impact on public health. The main purpose of this study is to obtain short-term forecasts of disease epidemiology that could be useful for policymakers and public institutions to make necessary short-term decisions. To evaluate the effectiveness of the proposed attention-based method combining certain data mining algorithms and the classical ARIMA model for short-term forecasts, data on the spread of the COVID-19 virus in Lithuania is used, the forecasts of epidemic dynamics were examined, and the results were presented in the study. Nevertheless, the approach presented might be applied to any country and other pandemic situations. The COVID-19 outbreak started at different times in different countries, hence some countries have a longer history of the disease with more historical data than others. The paper proposes a novel approach to data registration and machine learning-based analysis using data from attention-based countries for forecast validation to predict trends of the spread of COVID-19 and assess risks.

The COVID-19 pandemic has added an extremely high element of unpredictability to the global economy and the situation in general. Governments are trying to overcome the infection by taking serious measures in an effort to stabilize the situation. Experts are trying to predict how the situation may change and how it will look when the coronavirus can be restrained and which states will be the first to come out of the economic recession. At the moment, the priority is to solve urgent health care problems and maintain economic stability. Experts are already trying to look into the near future and understand how the disease rate can progress.

Short-term and long-term forecasting models are generally used to forecast certain situations and to alert us to events in the future so that we are better prepared. Short-term forecasting models [

Effective short-term prediction models are needed to predict the number of new cases. In this regard, it is important to develop strategic planning methods in the public health care system to avoid further increases in incidence of infection, as well as to introduce special measures to reduce the scope of infection. Various methods based on mathematical modeling and data mining are powerful tools for understanding the COVID-19 transmission [

Despite the limitations associated with medical data-based forecasting and the specific nature of the data being analyzed, forecasting plays an important role as it enables a better understanding of the current situation and makes plans for the future. Mathematical modeling and disease prediction are powerful tools for understanding the spread of COVID-19 and studying different scenarios. Various methods and time series analysis are currently being used for short-term forecasting of COVID-19 epidemic disease dynamics: linear forecasting models, including autoregressive integrated moving average (ARIMA) model [

The main purpose of this article is to provide short-term forecasting that could produce a reliable forecast for policymakers to make the necessary decisions and to provide useful guidance. In this paper, the authors provide the results of statistical forecasting of confirmed cases of COVID-19 in Lithuania using the attention-based approach and ARIMA models. The proposed methods might be applied to any country and other pandemic situations. The paper is organized as follows: Section 2 presents the analyzed data and methodology for short-term forecasting of confirmed cases of COVID-19 using the ARIMA models and the attention-based machine learning method. The experimental results of the proposed method as well as the comparison with Lithuanian forecasting results obtained using the ARIMA models are presented in Section 3. The results of the study are concluded in Section 4.

During the pandemic, the source of Lithuanian COVID-19 data changed, which made the task of spread forecasting challenging. The main data provider in Lithuania is the National Public Health Center (NPHC) under the Ministry of Health. Since the beginning of the spread of COVID-19 in Lithuania, the data has been announced on the website of the Ministry of Health. The problem with such an announcement was that no historical data was available, only the daily statistics, thus the authors had to collect, process, and store data by creating their own database. However, because of problems with data collection in the NPHC–delays by various institutions, errors caused by human factors, etc.–historical data has also been revised, but this has not been announced. At first, the revisions were not substantial and did not affect the short-term forecasts. However, corrections became substantial before the autumn of 2020, when the number of confirmed cases increased. Since the end of August, the National Public Health Center under the Ministry of Health started sharing files with the time series data. Although there were still problems with the quality of the data, but historical data became available. Since November, the only institution announcing the COVID-19 data is Statistics Lithuania.^{1}

Nevertheless, it has been noticed that some time series, such as recoveries or active cases, still suffer from quality problems due to delays in reporting recoveries by hospitals. Thus, in the middle of February 2021, Statistics Lithuania started announcing two time series for recovered and active cases: a

The authors use data^{2}

Despite the fact that the beginning of confirmed cases of COVID-19 in Lithuania is February 28, 2020, the authors in this research use data on the spread of the disease in Lithuania during the period from March 12, 2020, to February 1, 2021, because only one case was registered until March 12, 2020. Moreover, we take the earlier definition of recovered cases since new definitions have appeared after the period we are investigating.

Every day by January 26, 2021, forecast models were built based on 35 subsequent observations, and forecasts were performed for 5 steps ahead. As mentioned above, there were different trends in virus spread during the spring and autumn periods, so the authors built models for the spring sub-period by analyzing data for the period from March 12 to June 30, 2020 (the forecast of five steps ahead is included), and for the autumn sub-period from October 1, 2020 to January 31, 2021. We refer to these periods as “first wave” and “second wave” though in epidemiological terms these waves may have different time stamps. The authors do not investigate other sub-periods, since the summer was fairly stable and calm in terms of the spread of COVID-19.

Two approaches were used for the short-term forecasting of confirmed cases of COVID-19 in Lithuania: ARIMA models and the attention-based forecasting method. The results of ARIMA models and all the data used for modeling and forecasting are available at

ARIMA models are frequently used to forecast time series in a short period of time. Although such type of models is simple and easy to apply, these models show good performance for short-term forecasting. ARIMA models are best fitted to stationary or differentiated data if the data is stationary after the differentiation. In this paper, we apply the non-seasonal ARIMA(

To estimate the ARIMA model, the Hyndman-Khandakar algorithm [

The idea of the attention-based approach is to use a mechanism that selects specific factors from the data available. The idea can be accomplished by focusing attention on small regions of multidimensional information rather than on the data as a whole. In this way, dimensionality reduction techniques are used to draw attention to similar countries where the spread of the virus is similar and to further analyze the data for countries that fall into the attention cluster. The attention-based mechanism acts as an extractor of information while inferring the similarity and minimizing the number of countries used for spread forecasting.

The authors propose an attention-based forecasting method that consists of three steps:

Data registration regarding the first confirmed case of COVID-19 for the first wave and the day after July 1, 2020, when the number of confirmed cases per 100,000 is greater than or equal to 3 for the second wave.

Selection of countries most similar to Lithuania, using data mining and machine learning techniques.

Forecasting based on the use of trends in confirmed cases in the selected countries.

Each step of the method is described in more detail in the subsections below.

The outbreak of COVID-19 started at different time periods in different countries. Thus, some countries have a longer history and more historical data of the COVID-19 spread than others. The novelty of the proposed method is to consider the onset of the spread of COVID-19 in different European countries, and to compare the dynamics of the virus and integrate this knowledge into the forecast. The idea is to use data from countries with more historical disease data to forecast trends in Lithuania. For this purpose, we have recorded the data in such a way that the time series starts from the first confirmed case of COVID-19, i.e., we use an artificial time scale (number of days from the first confirmed case) rather than an actual calendar date for the spring sub-period (the first wave). For the autumn sub-period (the second wave), we took data from July 1, 2020, and registered data in such a way that the first day is the day when the number of confirmed cases per 100,000 population is greater than or equal to three. We chose a threshold value of three since the increase in confirmed cases has begun at this time point in the second sub-period for many countries. For the short-term forecasting for the first wave, countries that have more historical data, compared to Lithuania, on the disease from the first confirmed case of COVID-19 were selected: Austria, Belgium, Croatia, Denmark, Estonia, Finland, France, Germany, Greece, Iceland, Italy, Netherlands, North Macedonia, Norway, Romania, Spain, Sweden, Switzerland, and the United Kingdom. Accordingly, the following countries were identified in the same way for short-term forecasting in the case of the second wave: Austria, Belgium, Bulgaria, Croatia, Czechia, Denmark, France, Greece, Hungary, Ireland, Italy, Liechtenstein, Luxembourg, Malta, Montenegro, Netherlands, North Macedonia, Norway, Portugal, Romania, Serbia, Slovakia, Slovenia, Spain, Sweden, Switzerland, and the United Kingdom. Lithuanian data was also included for both waves.

The multidimensional data describe complex objects or phenomena characterized by many features. For better comprehension, it is useful to provide data in an easy-to-understand form: to define the structure of the data, relationships, and clusters. Multidimensional data visualization methods are used to provide data mining results in a more comprehensive form by drawing attention to similarities. The attention-based selection of the European Union countries for forecasting is performed by integrating multidimensional data clustering and data dimensionality reduction methods: self-organizing neural network (SOM), multidimensional scaling (MDS), and t-distributed stochastic neighbor embedding (t-SNE). The data was first clustered using the SOM neural network. For the clustering result inspection, visualization techniques such as MDS or t-SNE methods can be used. Different visualization techniques were chosen to validate the clustering results obtained by the SOM, using methods based on different operating principles. Dimensionality reduction methods transform the analyzed dataset from the

where

Typically,

The self-organizing neural network SOM was used for clustering of multidimensional data. SOM is a neural network-based method that is trained in an unsupervised way using competitive learning [

The MDS method is used to find a configuration of points in a space, usually Euclidean, where each point represents one of the objects or individuals, and the distances between pairs of points in the configuration match as well as possible the original dissimilarities between the pairs of objects or individuals [

The trends of the number of confirmed cases in the countries which belong to the same cluster as Lithuania and have more historical data on the disease from the first confirmed case of COVID-19 are used. The regression models with countries as covariates are considered. The forecasting is done for such a number of days ahead as is the history of the confirmed number of cases in the countries belonging to the same cluster as Lithuania. However, some countries, which belong to the same cluster as Lithuania, do not have a much longer history of confirmed cases than Lithuania. Thus, to obtain a forecast for required steps ahead, ARIMA models (see Section 2.2) were used to forecast the number of the confirmed cases in each country, and then these forecasts were employed in regression analysis to get the forecast of the confirmed cases in Lithuania.

To achieve the goal above, the linear regression with ARMA errors was used:

where

After the linear regression models were obtained for each country in the cluster, the forecast was calculated by taking the average of forecasts from these models:

The comparison of the accuracy of the forecasts over the considered time period was made by choosing models for every interval of 35 days and forecasting five steps ahead. The following measures of the forecast accuracy were used:

We estimate the ARIMA model for cumulative confirmed cases every day with 35 recent observations for the period March 12, 2020–January 26, 2021, dividing data into two sub-periods as follows: March 12–June 25, 2020, and October 1, 2020–January 26, 2021. Earlier data and the summer period are truncated due to a very small number of confirmed cases.

We would like to point out that modeling started at the end of March, and different types of models and lengths of data at first were used. Here we present the final version of the forecasting approach, thus the historical forecast in the paper might differ from the forecast announced for the public.

Note that the model was built based on the cumulative number of confirmed cases per 100,000 people. In addition, the authors analyze the non-seasonal ARIMA model for the daily data, though a slight seasonality might be observed because of the weekend effect, when fewer tests are performed. However, testing with the seasonality models does not have better performance than with the non-seasonal models.

A new ARIMA model is fitted for every day. The orders of each ARIMA model are shown in

For every day, not only a new model is built, but forecasts are performed for five steps ahead as well as prediction intervals with 80% and 95% prediction probabilities are computed.

The complete algorithm for every day model is given in Algorithm 1 presented in

The result of Algorithm 1 is the graph where the black line and black dots indicate predicted values, the red dots indicate true values, the dark blue band indicates the 80% prediction interval, and the light blue band indicates the 95% prediction interval (see

As it was mentioned earlier, the historical data has been changing over time, and the prediction scheme has also slightly changed over the pandemic period. Thus, a retrospective analysis of the goodness of the prediction of the ARIMA models has been accomplished. To achieve this goal, training set errors and prediction errors computed over the whole period, taking into account all historical models, were investigated (see

In total, 156 models (72 for the first wave and 84 for the second wave) were estimated, and training sets and prediction errors were saved and the final output of Algorithm 2 consists of two tables.

Model | RMSE | MAE2 | MAPE |
---|---|---|---|

1 | 0.32 | 0.24 | 3.53 |

2 | 0.32 | 0.24 | 3.15 |

3 | 0.34 | 0.26 | 6.65 |

… | … | … | … |

The result of a

The cumulative RMSE, MAE, and MAPE errors showed that the values of the median and variance increased with each prediction step (see

The same algorithms have been applied for the second period of data, as mentioned in SubSection 2.1: October 1, 2020–January 31, 2021.

The results obtained are very similar, but we have slightly larger errors. The bigger difference appears in the empirical probability for the true value to be in the prediction interval. Note that the authors have computed 80% and 95% prediction intervals, but retrospectively it is the indicator values if the true value is in this interval (Algorithm 2). Empirical prediction probabilities were computed as follows:

First wave | Second wave | |||
---|---|---|---|---|

Prediction step | 80% P.I. | 95% P.I. | 80% P.I. | 95% P.I. |

1 | 0.90 | 0.93 | 0.67 | 0.86 |

2 | 0.83 | 0.99 | 0.58 | 0.80 |

3 | 0.79 | 0.99 | 0.60 | 0.81 |

4 | 0.79 | 0.93 | 0.68 | 0.80 |

5 | 0.75 | 0.89 | 0.69 | 0.83 |

The results obtained using attention-based forecasting are presented in this section. The complete scheme for attention-based forecasting is given in Algorithm 4 (see

Following the data registration described in Section 2.3.1, 20 European countries with a longer history of the disease and virus spread than Lithuania were selected for the study: 17 European Union countries, the United Kingdom, Norway, and Northern Macedonia. Daily cumulative relative data (per 100,000 people per day) for the last 35 days of the research is used: cumulative new confirmed cases, deaths, recovered cases, and population density. As mention above, the multidimensional data consists of 106 features per country. Following the registration of data for the first case (see Section 2.3.1), a cluster and visual analysis of multidimensional data were carried out (see Section 2.3.2). To forecast the number of cases and disease trends in Lithuania, six countries were identified with more historical disease data, and in which the registered data for the past 35 days have a similar trend to Lithuania.

Rank | March | April | May | June |
---|---|---|---|---|

1 | Croatia | Croatia | Croatia | Croatia |

2 | North Macedonia | Romania | Finland | Greece |

3 | Greece | North Macedonia | Romania | Finland |

4 | Romania | Sweden | North Macedonia | Romania |

5 | Sweden | Greece | Greece | Estonia |

6 | Finland | Finland | Estonia | Norway |

However, since November, the situation begins to change.

Rank | September | October | November | December | January |
---|---|---|---|---|---|

1 | Bulgaria | Spain | Slovakia | Slovenia | Czechia |

2 | Sweden | North Macedonia | Italia | Austria | Slovenia |

3 | Ireland | Montenegro | Czechia | Belgium | Slovakia |

4 | Croatia | Romania | Spain | Hungary | Montenegro |

5 | Norway | Hungary | Hungary | Slovakia | Croatia |

6 | Romania | Slovakia | Montenegro | Italy | Austria |

The approach outlined in Section 2.3.3 and the results presented in Section 3.2.1 were used to obtain the forecasts. Six countries from the same cluster and closest to Lithuania were used to make forecasts based on the number of cumulative confirmed cases. The same time intervals were considered to compare the results with those of the ARIMA models: the first is March 12–June 30, 2020 (corresponds to the first wave of COVID-19), and the second is October 1, 2020–January 31, 2021 (corresponds to the second wave).

Algorithm 4 was applied to obtain results, and the graph presents the final result (see

Comparing the results of ARIMA models and attention-based forecasting (see, for example,

All models, as in the case of ARIMA models, were estimated historically (overall 156 models were fitted: 72 in the case of the first wave and 84 in the case of the second wave), training set errors were saved and prediction errors over the two considered sub-periods were obtained (see

The comparison of the results obtained by the attention-based method for the two sub-periods (waves) shows that the forecast errors in the case of the first wave (see

The empirical probabilities that the true value is in the prediction region were obtained (see

First wave | Second wave | |||
---|---|---|---|---|

Prediction step | 80% P.I. | 95% P.I. | 80% P.I. | 95% P.I. |

1 | 0.86 | 0.96 | 0.62 | 0.81 |

2 | 0.89 | 0.99 | 0.61 | 0.78 |

3 | 0.85 | 0.99 | 0.58 | 0.77 |

4 | 0.88 | 0.97 | 0.56 | 0.75 |

5 | 0.88 | 0.97 | 0.51 | 0.70 |

This study investigates methods for obtaining short-term forecasts of the COVID-19 virus spread that can be useful for policymakers and public institutions in making necessary decisions and providing useful guidance as to what might happen in the coming week. By borrowing the idea of the attention-based approach from Long Short-Term Memory deep neural networks and combining this approach with the mathematical modeling and prediction methods, we obtain a powerful tool for understanding the spread of COVID-19 and exploring different short-term spread scenarios.

Two approaches were used for short-term forecasting of the confirmed COVID-19 cases in Lithuania: the ARIMA model and a new attention-based forecasting method, which combines machine learning techniques and statistical methods. The novelty of the approach presented above is the use of data from countries with a longer history of the disease to forecast trends in Lithuania. To this end, the authors introduce the data registration from the first confirmed case of COVID-19. Such a way of data registration and integral data analysis using techniques for clustering and multidimensional data dimensionality reduction allows to assess trends in the spread of the virus in different countries and to group them according to similarity, i.e., to draw attention to those countries where the spread of COVID-19 behaves in a similar way. Moreover, the proposed approach allows to assess the dynamics of the spread of the virus and changes in the situation over time. The clustering analysis shows the specificity of the virus spread and enables to review the measures applied in the countries of the same cluster to control the virus and assess the impact (effectiveness) of the measures applied on the increase in the number of newly confirmed cases of the disease. The attention-based focus and identification of countries that are similar to the investigative one, i.e., Lithuania, with the ability to have a longer history of virus spread analysis, as well as the forecast based on their trends, allows to create and foresee the virus spread scenarios based on the historical data of other countries.

Summarizing the results of the forecasting, it can be concluded that both methods demonstrate similar accuracy in forecasting of the so-called first wave time period COVID-19 cases (March 12–June 30, 2020), none of the methods outperforms the other. The forecast accuracy obtained using the attention-based forecasting, taking the second wave (October 1, 2020–January 31, 2021), is slightly lower compared to the results obtained by ARIMA. The explanation can be related to the fact that the situation and trends of confirmed cases in the countries being rather different, the number of countries in the same cluster as Lithuania is not large, and the distances from Lithuania in the cluster are varied. However, the attention-based forecasting approach gives promising results. Higher forecast accuracy is achieved in periods when the countries in the same cluster are closer to Lithuania. The two approaches discussed above complement each other and provide insights for the short-term forecasting of COVID-19 spread and enable to validate the forecasting results. The dimensionality reduction techniques, viewed as an attention-based method for similar COVID-19 spread country or region selection, combined with regression analysis, provide a means to validate the forecasting results. The approach presented in the paper can be applied to any country with the view to analyze other pandemic situations.

The authors are thankful for the high-performance computing resources provided by the Information Technology Open Access Center at the Faculty of Mathematics and Informatics of Vilnius University Information Technology Research Center.