The frequent missing values in radar-derived time-series tracks of aerial targets (RTT-AT) lead to significant challenges in subsequent data-driven tasks. However, the majority of imputation research focuses on random missing (RM) that differs significantly from common missing patterns of RTT-AT. The method for solving the RM may experience performance degradation or failure when applied to RTT-AT imputation. Conventional autoregressive deep learning methods are prone to error accumulation and long-term dependency loss. In this paper, a non-autoregressive imputation model that addresses the issue of missing value imputation for two common missing patterns in RTT-AT is proposed. Our model consists of two probabilistic sparse diagonal masking self-attention (PSDMSA) units and a weight fusion unit. It learns missing values by combining the representations outputted by the two units, aiming to minimize the difference between the missing values and their actual values. The PSDMSA units effectively capture temporal dependencies and attribute correlations between time steps, improving imputation quality. The weight fusion unit automatically updates the weights of the output representations from the two units to obtain a more accurate final representation. The experimental results indicate that, despite varying missing rates in the two missing patterns, our model consistently outperforms other methods in imputation performance and exhibits a low frequency of deviations in estimates for specific missing entries. Compared to the state-of-the-art autoregressive deep learning imputation model Bidirectional Recurrent Imputation for Time Series (BRITS), our proposed model reduces mean absolute error (MAE) by 31%~50%. Additionally, the model attains a training speed that is 4 to 8 times faster when compared to both BRITS and a standard Transformer model when trained on the same dataset. Finally, the findings from the ablation experiments demonstrate that the PSDMSA, the weight fusion unit, cascade network design, and imputation loss enhance imputation performance and confirm the efficacy of our design.

In combat situational awareness and understanding, the air battlefield has been the focal point of attention. The aerial target is a crucial element in the air battlefield situation. The radar early warning system detects and tracks targets, generating and storing abundant time-series track data in the information processing center encompassing a wealth of general information and latent knowledge. Currently, data mining on radar-derived time-series tracks of aerial targets (RTT-AT) has provided favorable support in target behavior pattern analysis [

However, missing values are pervasive in RTT-AT due to radar terminal malfunction, packet loss during data transmission, and storage loss [

Intuitively, the imputation of missing values RTT-AT can be formulated as a problem of multivariate time series imputation (MTSI), which involves estimating missing values in a time series with multiple attributes. In the domain of MTSI, utilizing the statistical properties of either a single sample or the entire dataset to impute missing values is a highly classical approach, including techniques such as mean imputation, median imputation, and more. Extensive research has demonstrated that this method can lead to considerable bias [

In recent years, data-driven deep learning techniques have garnered significant attention and have extensively been applied in various domains, showcasing their robust capabilities and vast potential. Specifically, these techniques have demonstrated remarkable achievements in computer vision, natural language processing, bioinformatics, and beyond [

GRNN-based imputation methods can be categorized into two distinct classes: unidirectional GRNN-based models and bidirectional GRNN-based imputation models [

The typical missing patterns that occurred in RTT-AT are referred as random missing at altitude attribute (RM-ALT) and random missing at all attributes (RM-ALL). Due to the difference in the generation mechanism, the two differ significantly in the distribution of missing values from the random missing (RM) mode which is the concern of general time series imputation works. Specifically, with the RM pattern, missing values occur randomly at any time in any attribute; in RM_ALT, missing values occur randomly at any time, specific to the altitude attribute; and in RM-ALL, missing values appear randomly at any time, resulting in all attributes being unobserved at that moment.

Hence, when the traditional imputation methods designed for addressing RM are directly applied to tackle the two missing patterns in RTT-AT, their underlying assumptions become inapplicable, potentially leading to increased data bias. In addition, conventional deep learning imputation methods based on GRNN are vulnerable to cumulative errors and long-term dependency loss problems due to the network structure, leading to performance bottlenecks. The self-attention mechanism, being data-driven and autoregressive, can address the aforementioned issues. This paper proposes a model to impute missing values of RTT-AT based on an improved self-attention-based network. The contributions include:

(1) This paper posits RTT-AT data imputation as a problem of imputing multivariate time series. A novel non-autoregressive imputation model to minimize imputation loss is proposed to tackle the typical missing patterns. The imputation loss enables the training of the non-autoregressive model, facilitating its focus on learning the missing properties rather than entirely reconstructing the observations. Non-autoregressive models with minimizing imputation loss as the learning objective perform significantly better than models (whether autoregressive or non-autoregressive) with minimizing reconstruction loss as the learning objective.

(2) The designed non-autoregressive imputation model comprises two cascades of probabilistic sparse diagonally masking self-attention (PSDMSA) units and a weight fusion unit. In PSDMSA, the probabilistic sparsity is introduced to emphasize the importance of prominent dot-product pairs and simultaneously decrease computational complexity. Moreover, the diagonal masking ensures more significant attention to the temporal dependencies and attribute correlations between time steps. The fusion unit can automatically learn the weights of two units’ outputs by considering temporal and attribute correlations and missing information to obtain a better representation. The design enables enhanced capability to capture missing patterns and improved imputation quality with fewer parameters and reduced training time costs.

(3) A series of comparison and ablation experiments are conducted to evaluate the effectiveness of our proposed model. The experimental findings indicate that the method outperforms other approaches in imputing missing values of RTT-AT regardless of variations of missing rate. Additionally, while the method’s performance decreases as the missing rate increases, this decrease is relatively limited. The ablation experiments demonstrate the efficiency and effectiveness. The structure of the following sections is as follows:

In this section, the definitions and formal descriptions of multivariate time series, time series missing value imputation, related works
on time series missing value imputation and typical missing patterns in RTT-AT are presented in

A time series is a collection of values obtained from a period of continuous time measurements. Formally, a collection of timestamps

The binary missing mask matrix

The aim of time series missing value imputation is to employ the imputation model to substitute every missing component of

In general, based on the difference in the number of samples used, time series missing value imputation methods can be divided into two types: in-sample methods and cross-sample methods. The former uses information from the current sample only, while the latter uses information from the current sample and other samples during the training or testing phase.

Imputation methods based on statistical characteristics are simple and easy to apply. The most commonly used statistical characteristics are the mean value [

The matrix-decomposition-based imputation methods use a matrix/tensor approach to imputing missing values in multivariate time series. Literature [

Nearest-neighbor-based methods estimate missing values by finding the nearest neighbors (determined by a set distance criterion). The method involves finding the nearest neighbors of the missing value through other attributes and then updating the missing value with the average of these nearest neighbors. Among the available methods, considering local similarity, some use the last observed value instead of the missing value [

The expectation maximization (EM) algorithm consists of two steps. In the E-step, the expectation of the complete data sufficient statistics is calculated based on the observations and current parameter estimates. In the M-step, the parameter estimates are updated using the maximum likelihood method. The algorithm iterates until the convergence criteria are met. Based on the final parameter estimates and observations, the imputation of each missing value can be calculated. Literature [

Constraint-based methods are used to discover rules in the dataset and estimate missing values. In time series imputation, similarity rules, such as differential dependence or comparable dependence, are commonly used to study the distance and similarity between timestamps as well as values [

Many researchers have applied machine learning methods to time series missing value imputation, inspired by their outstanding performance in time series identification and prediction. Literature [

Recently, deep learning techniques that can mine deep hidden features have been utilized for time series missing value imputation. Recurrent neural networks are commonly used for these methods due to their ability to handle sequential information. Currently, the two most representative methods are M-RNN and BRITS. M-RNN is a multinomial recurrent neural network that imputes within data streams and estimates between data streams. Experimental results on five real world medical datasets show that M-RNN significantly improves the estimate quality of missing values and performs robustly [

According to Little and Rubin’s definition, there are three types of missing generation mechanisms in MTS [

1. Missing Completely at Random (MCAR), where the probability of a target attribute being missing is not influenced by any observed and unobserved attributes, in simpler terms; the missingness occurs randomly.

2. Missing at Random (MAR), where the probability of missing the target attribute is related to the observed attributes but not the unobserved variables.

3. Missing not at Random (MNAR) refers to scenarios where the likelihood of a target attribute being missing is connected to unobserved attributes, indicating a non-random characteristic in the missingness.

Moreover, attributes refer to the characteristics or features of a MTS. When all the missing values pertain to the same attribute, the missing is called single-valued missing (SM). Furthermore, if the missing values belong to different attributes, it is called arbitrary missing (AM).

This paper centered on RTT-AT acquired and stored by early warning detection radar terminals. Detection radar can be divided into height finder radar, two-dimensional radar, and three-dimensional radar based on the number of dimensions in which targets’ position information can be detected. The two-dimensional radar can determine aerial targets’ bearing and distance while collaborating with the height finder radar to identify aerial targets and attain their bearing, distance, and altitude. The three-dimensional radar can simultaneously acquire aerial targets’ bearing, distance, and altitude information. While the three-dimensional radar offers remarkable functional integration and high-performance capabilities, it comes at a significant cost. Conversely, the cheap and easy-to-operate two-dimensional radar and the height finder radar remain the most extensively employed in early warning and detection. Therefore, the scenario of coordinated detection by the two-dimensional radar and the height finder radar is considered, where the two-dimensional radar typically acquires the target’s azimuth and distance information initially, while the height finder radar obtains the altitude data. After undergoing pre-processing operations such as coordinate transformation, the radar-derived trajectory data has three attributes: longitude, latitude, and altitude. In this cooperative working mode, the following two typical missing patterns are considered, as shown in

The missing pattern typically addressed by general time series imputation methods is random missing (RM), in which the missing values occur randomly and are uncorrelated among all attributes. Nevertheless, the two missing patterns in RTT-AT exhibit distinct characteristics from RM. In RM_ALT, the missing values occur randomly only at any moment in the altitude attribute and not in the other attributes. Additionally, in RM_ALL, the missing values occur randomly at any moment, and missingness with missingness in the altitude dimension is correlated with missingness in both the other dimensions. Furthermore, when there are missing values, the values of all attributes at that moment in time are unobserved. A summary of the above three modes is shown in

Missing pattern | Missing mechanism | Missing distribution |
---|---|---|

RM | MCAR | AM |

RM-ALT | MCAR | SM |

RM-ALL | MNAR | AM |

The methodology consists of two parts: (1) the missing value imputation workflows presented in

The missing value imputation workflows for RTT-AT are shown in

During the data processing phase, the initial step involves partitioning the raw multivariate time-series radar track data into dedicated training, validation, and testing sets according to a predefined split ratio. Subsequently, a standardization operation is executed on the data within the training, validation, and testing subsets. This step is crucial as it eliminates disparities in measurement units and scales across different attributes, ultimately improving the model’s accuracy in filling missing values. Moreover, standardizing the data also improves the model’s convergence properties, reducing potential oscillations caused by disparities in data scales during the optimization process. The process of standardization can be represented as follows:

In

Due to their vast and variable nature in length, radar-derived track time series present a challenge for effective processing using deep learning models. A methodological approach is adopted to address this issue. Specifically, a non-interlaced sliding window of identical length is utilized to segment the track data. This segmentation ensures that the resulting sliced dataset exhibits uniformity in sample length, enhancing its amenability to deep learning-based processing. For visual clarification, the precise operational procedure of this slicing method is graphically elucidated in

Subsequently, observations are randomly masked to generate incomplete samples with missing values in conformance with the predetermined proportion of missingness. Additionally, a corresponding mask matrix is produced for these missing values according to

During the imputation model learning phase, the work addresses the challenge of imputing missing values in RTT-AT. This is achieved by treating it as a supervised learning problem, where we aim to learn relationships between observed data points to minimize the loss between missing values and corresponding predicted values and capture missing patterns. The training set plays a crucial role in developing the imputation model, calculating loss, and updating parameters through an optimization process guided by a specific loss function. Meanwhile, the validation set serves a dual purpose: fine-tuning the hyper-parameters of the model and evaluating its initial performance and effectiveness. This provides valuable insights into the model’s potential performance. Following the completion of training and learning, the test set samples are utilized to comprehensively evaluate the trained model’s imputation capabilities and generalizability, employing relevant evaluation metrics to quantify its performance.

To delve into more specific details, let us examine the notation and learning objective used in our imputation model. The unaltered complete sample before artificial masking as

where

The network designed comprises two cascaded PSDMSA units and a weight fusion unit, as depicted in

The vanilla self-attention mechanism was initially proposed by Vaswani and has subsequently found extensive application in natural language processing, computer vision, and sequence modeling [

where

Due to vanilla self-attention’s long-tailed distribution structure, dot-product operations on queries and keys show that most dot-product pairs have a limited contribution, and only a few pairs have a dominant role. Therefore, to emphasize the importance of prominent dot-product pairs and simultaneously decrease computational complexity, the probabilistic sparsity is introduced into the vanilla self-attention mechanism of the imputation model [

In

(1) PSDMSA evaluates the importance of queries based on the Kullback-Leibler divergence before performing the dot-product operation and selects only the top-

(2) The elements at the diagonal of the attention map are set to negative infinity (set to

The specific process is shown in

1: |

2: set the sample scores |

3: compute sparsity score |

4: set top- |

5: compute |

6: |

7: update |

8: |

9: set |

10: set |

11: set PSDMSA feature map |

Subsequently, to enhance information extraction efficiency, PSDMSA is expanded to multi-head PSDMSA:

where

In the initial Transformer design, temporal order was not considered. Vaswani implemented positional embedding to provide positional information to the input vector to the Transformer [

where

The feed-forward network consists of two cascaded fully connected layers and a ReLu function as shown in

where

In the first unit, as shown in

In the second unit, the feature vector obtained by concatenating the

The fusion unit’s function is to automatically learn the weights of

In this study, the data come from the complete RTT-AT records in the same area obtained and kept by a radar team in the air situation simulation system. This team comprised a two-dimensional radar and a height finder radar. Moreover, the records cover six typical target activities: reconnaissance, police patrol, refueling, AWAC, attack, and retreat.

The dataset consists of 12000 time-series tracks with a sampling frequency of 10 Hz. The lengths of tracks range from 1052 to 8534. After undergoing a coordinate transformation process, each time-series track in the dataset is represented by three attributes: longitude (in degrees), latitude (in degrees), and altitude (in km).

A set of procedures designed to preprocess the data are conducted to prepare the initial data for training the imputation model. These procedures are outlined below. First, data slicing is performed, dividing all track data into segments of length 500 without overlapping. This slicing approach results in a total sample size of 166000.

Next, all data samples are standardized by attribute, ensuring each attribute has a mean of 0 and a standard deviation of 1. This step creates the standardized sample set, essential for maintaining consistent scales across different attributes during model training.

Lastly, the missing values of various patterns and rates are introduced into the standardized sample set, forming two distinct datasets:

where

Samples in

In this paper, three widely used metrics are utilized to assess the imputation performance of our model: mean absolute error (MAE), root mean square error (RMSE), and mean relative error (MRE). These metrics comprehensively evaluate imputation accuracy by calculating the differences between actual missing values and their imputed counterparts. Specifically,

A batch size of 32 is used for model training, validation, and testing, which strikes a balance between computational efficiency and generalization performance. An early stopping mechanism that monitors the training loss is employed to prevent overfitting and ensure timely convergence. Training is halted, and the best model parameters are saved if the loss does not decrease after 100 consecutive epochs, indicating potential stagnation in the learning process. The chosen optimizer for our model is Adam. Specific hyperparameters associated with the Adam optimizer, such as the learning rate, are detailed in

Hyperparameter | Value |
---|---|

Batch size | 32 |

Number of multi-head PSDMSA units | 2 |

Number of multi-head PSDMSA blocks per unit | 1 |

Hidden size of feed-forward network | 512 |

Model hidden dimension | 64 |

Number of heads in multi-head PSDMSA | 8 |

The dimensions of |
64,64,64 |

Importance factor in PSDMSA | 5 |

Learning rate | 0.005 |

Epoch | 10000 |

Optimizer | Adam |

To ensure consistent and reproducible results, all experiments were conducted on a single computer platform with specific hardware and software configurations. The platform was equipped with an Intel i5 10200 CPU, 32 GB of memory, and an NVIDIA GeForce RTX 2060 GPU, providing a balanced computational environment for our research. And the deep learning framework used is Pytorch.

In

(1) Zero Imputation (Zero). Imputing missing values with zero; (2) Mean Imputation (Mean) [

The results of different imputation methods for RM-ALT at varying missing rates appear in

Method | 5% | 10% | 15% | 20% |
---|---|---|---|---|

Zero | 0.8439/1.0022/1.0000 | 0.8453/1.0031/1.0000 | 0.8427/1.0000/1.0000 | 0.8426/0.9997/1.0000 |

Mean | 0.0548/0.1902/0.0650 | 0.0551/0.1908/0.0652 | 0.0553/0.1909/0.0657 | 0.0548/0.1902/0.0650 |

Median | 0.0516/0.2056/0.0611 | 0.0519/0.2055/0.0614 | 0.0519/0.2050/0.0616 | 0.0471/0.1284/0.0563 |

KNN | 0.0234/0.0595/0.0277 | 0.0241/0.0622/0.0285 | 0.0246/0.0640/0.0292 | 0.0247/0.0573/0.0295 |

RF | 0.0179/0.0427/0.0212 | 0.0181/0.0437/0.0214 | 0.0180/0.0426/0.0214 | 0.0181/0.0431/0.0215 |

M-RNN | 0.0235/0.0561/0.0279 | 0.0243/0.0571/0.0288 | 0.0275/0.0620/0.0326 | 0.0274/0.0650/0.0326 |

BRITS | 0.0210/0.0547/0.0249 | 0.0205/0.0537/0.0243 | 0.0218/0.0570/0.0258 | 0.0259/0.0589/0.0308 |

Transformer | 0.0154/0.0258/0.0193 | 0.0158/0.0262/0.0204 | 0.0165/0.0295/0.0227 | 0.0172/0.0321/0.0246 |

Ours |

Firstly, the Zero method demonstrates the poorest performance among all considered methods. Secondly, KNN and RF consistently outperform Median and Mean methods at all missing rates, emphasizing their efficacy in imputing missing data over imputation methods based on statistical characteristics. Specifically, the RF method exhibits a substantial reduction of 65.3%, 79.2%, and 65.3% in MAE, RMSE, and MRE values, respectively, compared to the Median method when dealing with a missing rate of 20%. Moreover, our results indicate that the M-RNN method performs inferior to both KNN and RF. On the other hand, the BRITS method, which also relies on GRNNs, yields lower MAE and RMSE than the KNN method for missing rates of 5%, 10%, and 15%. However, at a missing rate of 20%, BRITS experiences a higher MAE value than KNN but maintains a lower RMSE. These findings imply that the BRITS method presents less bias than KNN at specific missing values. Nevertheless, its overall performance remains inferior to that of RF across varying missing rates.

On the contrary, irrespective of the missing rate, our proposed model shows lower MAE, RMSE, and MRE values than others. This implies that it possesses more extraordinary imputation ability when facing RM-ALT. When the missing rate is 5%, our proposed method yielded a 22.3% decrease in MAE, a 45.2% decrease in RMSE, and a 21.7% decrease in MRE relative to RF; when the missing rate is 20%, the respective decreases were 13.8%, 42.9%, and 13.5%. These results reveal that our method exhibits significant improvement in RMSE, which indicates that imputing missing values using this method does not result in significant deviations at specific missing points.

Method | 10% | 20% | 30% |
---|---|---|---|

Zero | 0.8356/1.0005/0.9999 | 0.8367/1.0015/1.0000 | 0.8347/0.9988/1.000 |

Mean | 0.0485/0.1225/0.0580 | 0.0484/0.1217/0.0579 | 0.0483/0.1208/0.0578 |

Median | 0.0472/0.1302/0.0566 | 0.0471/0.1287/0.0564 | 0.0471/0.1282/0.0564 |

KNN | 0.0222/0.0506/0.0266 | 0.0231/0.0524/0.0276 | 0.0239/0.0557/0.0287 |

RF | 0.0503/0.1269/0.0601 | 0.0501/0.1260/0.0599 | 0.0501/0.1256/0.0600 |

M-RNN | 0.0216/0.0574/0.0258 | 0.0226/0.0601/0.0270 | 0.0251/0.0633/0.0301 |

BRITS | 0.0167/0.0520/0.0199 | 0.0175/0.0550/0.0210 | 0.0169/0.0520/0.0210 |

Transformer | 0.0134/0.0160/0.0210 | 0.0142/0.0211/0.0170 | 0.0158/0.0232/0.0190 |

Ours | |||

Method | 40% | 50% | 60% |

Zero | 0.8354/0.9996/1.0000 | 0.8356/0.9999/1.0000 | 0.8349/0.9991/1.0000 |

Mean | 0.0482/0.1212/0.0577 | 0.0484/0.1219/0.0580 | 0.0485/0.1220/0.0580 |

Median | 0.0471/0.1284/0.0563 | 0.0473/0.1298/0.0566 | 0.0473/0.1298/0.0566 |

KNN | 0.0247/0.0573/0.0295 | 0.0258/0.0607/0.0309 | 0.0268/0.0642/0.0321 |

RF | 0.0500/0.1259/0.0599 | 0.0501/0.1269/0.0600 | 0.0502/0.1270/0.0601 |

M-RNN | 0.0311/0.0703/0.0372 | 0.0320/0.0736/0.0382 | 0.0388/0.0763/0.0464 |

BRITS | 0.0206/0.0610/0.0247 | 0.0197/0.0596/0.0235 | 0.0215/0.0625/0.0257 |

Transformer | 0.0174/0.0258/0.0208 | 0.0188/0.0270/0.0225 | 0.0202/0.0314/0.0242 |

Ours |

Zero imputation yielded the least favorable results of all the models, with MAE, RMSE, and MRE values of 0.8356, 1.005, and 0.9999, respectively, for a missing rate of 10%. These results strongly indicate a significant data bias. The KNN surpasses the Mean and Median methods across all missing rates since it incorporates the information from the several nearest neighboring samples. The performance of the RF is inferior to that of the Mean and Median methods. This suggests that the imputation mechanism of RF may not be appropriate for addressing the RM-ALL issue.

Furthermore, the models based on deep learning demonstrate remarkable superiority over other methods for all missing rates. Among the diverse deep-learning models, self-attention-based models produce considerably superior outcomes in comparison to their GRNN-based equivalents. At a missing rate of 10%, the Transformer shows a decrease of 22.2%, 65.0%, and 46.2% in MAE, RMSE, and MRE, respectively, compared to the best GRNN-based method BRITS. Similarly, at a missing rate of 60%, the MAE, RMSE, and MRE illustrate a decline of 6%, 49.8%, and 5.8% compared to BRITS. Finally, our proposed model achieves superior results at all missing rates compared to Transformer, and the MAE decreases by 30.8%, 31.9%, 27.5%, 27.5%, 42.6%, and 44.6% for missing rates of 10%, 20%, 30%, 40%, 50% and 60%, respectively. Compared to BRITS, the MAE, RMSE, and MRE decrease by 46.1%, 70.2%, and 46.2%, respectively, at a 10% missing rate, and by 47.9%, 72.0%, and 47.9% at a 60% missing rate. It is worth noting that the RMSE values of BRITS and M-RNN models are significantly higher than the MAE values, which suggests that large errors have occurred at specific missing data points. The MAE and RMSE of our proposed method are much closer. Consequently, it infers that the model proposed seldom causes significant discrepancies for specific missing data points. The case imputation results illustrated in

In three figures, the imputation results of our proposed model and the BRITS model for three temporally consecutive sliced samples with the missing rate of 40% are presented. The black solid curves represent the ground truth, while the green circles denote the imputed values. It is evident that our proposed model effectively captures the temporal evolution of data and imputes plausible values for the missing data points. Conversely, BRITS’ imputed values for specific missing data points exhibit significant deviations from the actual values, leading to considerable data bias.

The training parameter count and computational efficiency of various deep learning imputation models in achieving the aforementioned results are examined. The results are listed in

Model | Number of training parameters | Training time per epoch |
---|---|---|

M-RNN | 2.66 M | 4.240 s |

BRITS | 2.14 M | 8.674 s |

Transformer | 1.52 M | 4.377 s |

Ours | 0.79 M | 1.054 s |

In this section, three ablation studies under the RM-ALL scenario are conducted to evaluate the reasonableness of our proposed model. In

Self-attention mechanism | 10% | 20% | 30% | 40% | 50% | 60% |
---|---|---|---|---|---|---|

Full | 0.0111 | 0.0109 | 0.0112 | 0.0116 | 0.0125 | 0.0131 |

DMSA | 0.0101 | 0.0106 | 0.0109 | 0.0116 | 0.0114 | 0.0121 |

PSSA | 0.0094 | 0.0112 | 0.0107 | 0.0111 | 0.0114 | 0.0123 |

PSDMSA | 0.0090 | 0.0092 | 0.0100 | 0.0103 | 0.0108 | 0.0112 |

The experimental results demonstrate that the model utilizing PSDMSA consistently outperforms the vanilla self-attention mechanism. Across missing rates of 10%, 20%, 30%, 40%, 50%, and 60%, the MAE values exhibit reductions of 19.0%, 15.6%, 10.7%, 11.2%, 13.6%, and 14.5%, respectively, when using PSDMSA. Additionally, our observations reveal that the model’s performance is enhanced when PSSA and DMSA are jointly utilized, surpassing models that rely exclusively on either mechanism. Specifically, when compared to the DMSA-based model, the PSDMSA-based model achieves MAE reduction rates of 10.9%, 13.2%, 8.3%, 11.2%, 5.3%, and 7.4% at missing rates of 10%, 20%, 30%, 40%, 50%, and 60%, respectively. Moreover, in comparison to the PSSA-based model, the PSDMSA-based model demonstrates MAE reductions of 4.4%, 8.9%, 6.5%, 7.2%, 5.3%, and 8.9% at corresponding missing rates of 10%, 20%, 30%, 40%, 50%, and 60%. These findings collectively suggest that PSDMSA significantly enhances imputation performance.

Generally, a more profound network architecture is associated with enhanced feature extraction and knowledge-learning capabilities. Consequently, following the modification of full self-attention and construction of the initial PSDMSA unit, a second unit is incorporated, and the feature vectors learned in the first unit are leveraged for representation reuse. However, the representations derived from the second unit do not consistently surpass those obtained from the first unit. Therefore, it is reasonable to combine the learned representations from both units. To enhance imputation quality, a weight fusion unit that adaptively adjusts the weights of the output representations from both PSDMSA units is introduced.

To prove our design’s better imputation ability, the performance of the four models with different combinations of units are compared in this part. The results are presented in

Design | 10% | 20% | 30% | 40% | 50% | 60% |
---|---|---|---|---|---|---|

One unit | 0.0115 | 0.0117 | 0.0121 | 0.0125 | 0.0124 | 0.0132 |

Two units | 0.0201 | 0.0208 | 0.0215 | 0.0224 | 0.0250 | 0.0285 |

Two units + residual | 0.0102 | 0.0107 | 0.0111 | 0.0133 | 0.0129 | 0.0139 |

Ours | 0.0090 | 0.0092 | 0.0100 | 0.0103 | 0.0108 | 0.0112 |

(1) One unit: The second PSDMSA unit and weight fusion unit are not included. The output of only one unit is used as the final representation; (2) Two units: The first and second PSDMSA units are employed without the weight fusion unit, and the output of the second unit is used as the final representation; (3) Two units + residual: The first and second PSDMSA units are employed without the weight fusion unit, and the final representation is derived from the combination of outputs of two units by using a residual connection; (4) Ours.

Based on the experimental findings, it is evident that the imputation performance of the One Unit model is superior to that of the Two Units model. This implies that increasing the network depth does not necessarily lead to improved performance and may even have a detrimental effect. Moreover, when examining the combination of the learned representations from two units, our model, which incorporates the weight fusion unit, outperforms the model that directly combines the outputs of two units using a residual connection (Two units + residual). Additionally, our model achieves better performance than the One-unit model. These findings demonstrate a significant enhancement in imputation performance resulting from our design.

To evaluate the suitability of imputation loss (IL) for the self-attention-based imputation model and assess its impact on performance enhancement, we compare the imputation performance of our proposed model using IL, reconstruction loss (RL), and a combination of IL + RL as the learning objectives.

Design | 10% | 20% | 30% | 40% | 50% | 60% |
---|---|---|---|---|---|---|

IL | 0.0090 | 0.0092 | 0.0100 | 0.0103 | 0.0108 | 0.0112 |

RL | 0.4310 | 0.3874 | 0.4873 | 0.4127 | 0.5537 | 0.5763 |

IL + RL | 0.0119 | 0.0130 | 0.0136 | 0.0148 | 0.0158 | 0.0170 |

As shown in the table, the PSDMSA-based model’s imputation performance is significantly impaired when RL is applied. This can be attributed to the non-autoregressive nature of the self-attention mechanism, which processes input data globally and in parallel. As a result, observations can be identified through the missing value mask matrix. When RL is used as the loss function, the self-attention-based model minimizes the discrepancy between observed and reconstructed values, neglecting the missing values. Furthermore, the PSDMSA-based model with IL + RL exhibits lower imputation performance than the model employing IL as the learning target. Overall, the model utilizing IL demonstrates optimal imputation performance unaffected by variations in the missing rate. For example, when IL is adopted as the learning objective, the MAE values in a scenario with a 60% missing rate show a decrease of 98.1% and 34.1% compared to models using RL and IL + RL, respectively. This signifies a significant improvement in imputation performance due to the application of IL.

This paper proposes a non-autoregressive imputation model that utilizes an improved self-attention mechanism to handle the missing values in RTT-AT. The proposed model consists of two cascaded PSDMSA units incorporating the sparse mechanism, diagonal masking, and a weight fusion unit. The model’s learning objective is to minimize imputation loss.

Initially, two common missing patterns in RTT-AT data are identified and formulated. Next, the differences between the imputed values provided by the proposed model and the other eight models and factual values are analyzed, as demonstrated by MAE, MSE, and MRE. During this process, the missing rates are varied to observe how model performance fluctuates. The experimental results in

Additionally, ablation experiments investigate the impact of attention mechanisms, module combinations, and loss functions on imputation performance. The results show that our model achieves outstanding imputation performance, highlighting the effectiveness of its design choices. Integrating PSDMSA, the weight fusion unit, and imputation loss effectively enhances the model’s imputation performance.

However, our model currently focuses on two missing patterns, RM-ALT and RM-ALL, and does not investigate its performance on other types of missingness, such as long gap missing. Furthermore, the model’s performance in the presence of outliers and noise in the data has not been explored yet. In future work, we will explore the model’s imputation capabilities in extreme scenarios, including long-term missing data and the coexistence of multiple missing patterns. Furthermore, we will utilize the current findings to perform experiments on downstream tasks, specifically classification and prediction, to evaluate the plausibility of the imputation outcomes.

The authors are very grateful to the referees and editors for their valuable remarks, which improved the presentation of the paper.

This work was supported by Graduate Funded Project (No. JY2022A017).

The authors confirm their contribution to the paper as follows: conception and design: Zihao Song, Yan Zhou, and Futai Liang; manuscript preparation: Zihao Song, Yan Zhou, and Wei Cheng; software: Zihao Song, and Chenhao Zhang; supervision: Yan Zhou, and Wei Cheng. All authors reviewed the results and approved the final version of the manuscript.

The data used or analyzed during the current study are available from the corresponding author on reasonable request.

The authors declare that they have no conflicts of interest to report regarding the present study.