Aiming at the problem of abnormal data generated by a power transformer on-line monitoring system due to the influences of transformer operation state change, external environmental interference, communication interruption, and other factors, a method of anomaly recognition and differentiation for monitoring data was proposed. Firstly, the empirical wavelet transform (EWT) and the autoregressive integrated moving average (ARIMA) model were used for time series modelling of monitoring data to obtain the residual sequence reflecting the anomaly monitoring data value, and then the isolation forest algorithm was used to identify the abnormal information, and the monitoring sequence was segmented according to the recognition results. Secondly, the segmented sequence was symbolised by the improved multi-dimensional SAX vector representation method, and the assessment of the anomaly pattern was made by calculating the similarity score of the adjacent symbol vectors, and the monitoring sequence correlation was further used to verify the assessment. Finally, the case study result shows that the proposed method can reliably recognise abnormal data and accurately distinguish between invalid and valid anomaly patterns.

With the extensive application of advanced sensor monitoring technology in the operation and maintenance of power transformers, the scale of transformer monitoring data shows an exponential trend in its growth, providing an important data foundation for the comprehensive state assessment and prediction of equipment; however, affected by various emergencies, an equipment on-line monitoring system will inevitably generate some abnormal data. According to the operating characteristics of the transformer, the abnormalities in the monitoring data are mainly divided into invalid abnormal data and valid abnormal data.

Invalid abnormal data include missing values and noise. Missing values refer to data interruptions caused by short-term failure of the sensing device, abnormal communication ports, and recording errors; Noise refers to data that deviate from the expected value due to factors such as unstable operation of monitoring equipment and external environmental interference. Data cleaning measures should be taken against invalid abnormal data to ensure the smooth progress of subsequent equipment condition assessment. Valid abnormal data refer to the horizontal migration changes in the trend of monitoring data caused by sudden failures and insulation degradation during operation of the equipment. Valid abnormal data contain key information about abnormal changes in the equipment state, reflecting the true performance of the equipment operating state, so it does not come within the ambit of the processing category of data cleaning.

Reliable identification of abnormal data and effective differentiation of abnormal patterns are important foundations for achieving on-line monitoring data cleaning and an accurate understanding of equipment operating status. In the anomaly data identification process, time series decomposition [

To obtain an accurate understanding of changes in the status of a power transformer and avoid the key information pertaining to any abnormal state from being mistakenly cleaned, it is necessary to conduct an in-depth analysis around the problem of pattern differentiation, however, there are few related research results. Reference [

The above-mentioned papers on abnormal data identification and pattern distinction mainly focus on the research of abnormal points, and do not take into account the differences in the overall characteristics of the sequence before and after the abnormal time under different abnormal patterns, which may easily lead to misjudgment of abnormal patterns. In the present work a complete anomaly detection technology framework was constructed for the problem wherein transformer monitoring data contain different types of anomaly data. First, the empirical wavelet transform (EWT), and autoregressive integrated moving average (ARIMA) model were used to model the monitoring data time series, and the residual sequence that can reflect the abnormal characteristics of the monitoring data is obtained by calculating the difference between the predicted value and the measured value. Then, use the isolated forest algorithm to perform abnormal recognition on the residual sequence, and use the recognition result as the segment boundary to segment the original sequence; Finally, the improved multi-dimensional SAX vector representation is used to represent the segment subsequence as the symbol vector. The similarity score of the two adjacent symbol vectors was calculated and combined with the decision threshold to realise anomaly pattern discrimination, and the assessment results were verified by using the sequence correlation. The effectiveness of this method is verified by testing data from an oil temperature and dissolved gas in oil of a 500-kV transformer. The research results can provide key technical support for the efficient cleaning of equipment monitoring data and accurate diagnosis of its operating status.

For the abnormal identification of transformer monitoring data, firstly, an EWT is used to adaptively decompose the original sequence into time series components with different frequencies; secondly, the time series components were modelled by ARIMA, and the predicted values of each component were reconstructed to obtain the predicted values of monitoring sequence; on this basis, the difference between predicted value and the measured value is calculated to obtain the residual sequence, and the abnormal data features will be clearly characterised in the residual sequence; finally, the isolated forest algorithm is used to extract abnormal information from the monitoring sequence.

The EWT is a signal adaptive analysis method [

Through the Fourier transform, the Fourier spectrum

The Fourier spectrum of the signal is adaptively divided into _{n}

The detailed coefficients and approximate coefficients were obtained through the following operations:

Reconstruct the original signal according to _{0}_{k}

Taking a 500-kV main transformer as an example, the top-layer oil temperature monitoring sequence with a length of 426 is processed by EWT to obtain four sets of modal components. The specific decomposition results are shown in

The ARIMA model is usually referred to as ARIMA (

Firstly, the stationarity test of input time series is needed to determine the value of difference order. In the present work, the construction test statistics were selected for hypothesis testing to determine the stationarity of the input time series. For non-stationary time series, it is necessary to repeat the difference process until the processed time series is stabilised. The difference processing process for a non-stationary time series {_{t}

The non-stationary time series {_{t}_{t}^{th} autoregressive term; ^{th} moving average; {

The construction of the ARMA model includes model ordering and parameter estimation. In the present work, the maximum likelihood method is used to estimate the parameters of the model. Based on the Akaike information criterion (AIC) [

The monitoring sequence of transformer is decomposed by EWT theory, and the ARIMA prediction model is constructed through use of the aforementioned steps for the modal components obtained by the decomposition. To ensure the prediction accuracy of the ARIMA model, the component values were predicted in one step. By sliding the fitting window and the prediction window to the right with time, the complete prediction sequence about the modal components can be obtained; and then the predicted results of each component were reconstructed to obtain a complete prediction sequence of the monitoring data.

EWT and ARIMA models were used to obtain the predicted values of monitoring indicators, which are subtracted from the actual measured values to obtain the residual items at the corresponding time, as given by

The isolation forest (iForest) algorithm is an unsupervised anomaly detection method for continuous data [

The isolated forest is composed of multiple isolated trees (iTree). The structure of iTree is as follows:

Randomly select n training datapoints as a sub-sample set and put it into the root node of the tree.

Specify an attribute dimension randomly, and randomly generate a cutting point s between the maximum and minimum values of the attribute dimension.

Use this cutting point to generate a hyperplane to divide the current nodal data space to obtain two sub-sample spaces, put data less than s into the left branch of the current node, and put data greater than or equal to s into the right branch of the current node.

Repeat Steps 2 and 3 to construct new subspace nodes until the data itself cannot continue to be split or the depth limit of the isolated tree is reached.

An isolated forest with multiple isolated trees was thus established so that abnormal data can be detected based on the path length

When

After effective extraction of abnormal data information, the accurate determination of any abnormal patterns can be realised. Invalid anomaly patterns mainly include noise and missing values. At the moment when an anomaly occurs, the observed value will deviate significantly from the expected value, and the time series before and after this time will retain relatively consistent characteristics; the effective abnormal mode mainly refers to the horizontal migration and trend change of monitoring data caused by abnormal changes in equipment status, and the time series characteristics before and after the abnormality differ significantly, therefore, on the basis of dividing the time sequence by using the abnormal point as segment point, an improved multi-dimensional SAX vector representation method is introduced to perform multi-dimensional symbolic vector representation of the segmented sub-sequence. Then, by calculating the similarity score of two adjacent symbol vectors and combining these with the decision threshold to distinguish different abnormal patterns, and further use sequence correlation analysis to verify the results of pattern determination.

Symbolic aggregation approximation (SAX) is commonly used for symbolic representation of time series [

Z-score standardisation of the time series

Z-score standardisation can transform data of different orders of magnitude into the value of unified measurement to ensure comparability of data.

Segment the time series equidistantly and express their eigenvalues

The normalised time series are divided into equidistant segments, and mean value, slope, and sample entropy are selected as the eigenvalues of the time series to construct the eigenvalue vector that can fully characterise the time series.

Symbolic vector representation of time series

According to the numerical distribution of time series eigenvalues, the numerical space between each type of eigenvalue is divided with equal probability, and different characters are used to represent the divided numerical subspace regions, such as the letter set {A, B, C, D, E, …}. Let the scale parameter of the set be

Taking 60 sets of top oil temperature monitoring data of a 500-kV main transformer as an example, the process of multi-dimensional symbolic vector representation is realised. First, the monitoring sequence is normalised by Z-score (

Then, the normalised monitoring sequence is divided into 10 segments at equal intervals, and the mean, slope, and sample entropy of each segmented sequence are calculated (

Fragment | Mean | Slope | Sample entropy |
---|---|---|---|

1 | 0.110 | 0.453 | 0.470 |

2 | 0.781 | 0.371 | 0.368 |

3 | 0.334 | 0.357 | 0.214 |

4 | −0.049 | 0.305 | 0.470 |

5 | 0.445 | 0.253 | 0.894 |

6 | 0.198 | 0.104 | 0.134 |

7 | −0.129 | 0.237 | 0.693 |

8 | −0.393 | 0.474 | 0.470 |

9 | −0.313 | 0.298 | 0.214 |

10 | −0.984 | −0.068 | 0.080 |

Furthermore, the numerical space of various eigenvalues is divided into equal probabilities with α set to 20, and all subspaces represented by letters “A” to “T” from bottom to top.

Finally, the symbolic representation of the oil temperature sequence eigenvalues is obtained (

In summary, a three-dimensional real vector space was constructed by improving the SAX vector representation method. The three dimensions in the space represent the three eigenvalues of mean, slope, and sample entropy, respectively. Therefore, the characteristics of each sub-segment of the time series can be represented by a symbol vector in three-dimensional space, for example, ^{th} sub-segment of the time series.

Through the aforementioned steps, the multi-dimensional symbolic vector representation of each segmented subsequence is realised. When the abnormal point belongs to a valid abnormal mode, the characteristics of the sub-sequences on the left and right sides of the abnormal point will differ greatly; while when the abnormal point belongs to the invalid abnormal mode, the sub-sequences on the left and right sides of the abnormal point will maintain more consistent characteristics. Therefore, by calculating the similarity values of the symbol vectors of the sub-sequences on both sides of the abnormal point to determine the anomaly pattern, the specific process is as follows:

For a segment boundary, the lengths of the multi-dimensional symbolised vectors of the sub-sequences on both sides are compared. Then take the multi-dimensional symbolic vector sequence (

Move the target template sequence (

where,

Set the threshold

Repeat the above steps until the abnormal patterns of all abnormal points in the monitoring sequence are determined.

Based on the example given here, through several similarity retrieval experiments in advance, it is found that the similarity score of two groups of sequences with relatively consistent pattern is stable and below 0.5, therefore, the pattern decision threshold was set to 0.5; however, considering the limitation of threshold setting, sequence correlation analysis was introduced to verify further the results of pattern differentiation on the basis of using thresholding to distinguish abnormal patterns.

The commonly used methods for time series correlation analysis include the Apriori algorithm [

The grey correlation analysis algorithm judges the strength of the correlation between parameters according to the similarity of the geometric shapes of the parameters. Through quantitative analysis of the development trend of the dynamic process, the algorithm completes the comparison of the geometric relationships between the time series and calculates the degree of correlation between the parameters.

Here, the reference sequence is denoted by

On this basis, the correlation coefficients of the corresponding elements of the comparison series and the reference series are calculated:

According to the grey correlation coefficients at each time point, the grey correlation degree between the reference sequence and the ^{th} comparison sequence can be obtained:

The greater the value of _{i}_{m}

An abnormal detection technology framework for transformer monitoring data was constructed: it includes abnormal recognition and pattern determination function modules, as shown in

The grey correlation analysis algorithm is used to measure the correlation between the sequence to be detected and other monitoring sequences. If there is a correlation sequence, the verification link is retained in the process of determining its abnormal pattern; if there is no correlation sequence, the verification link is removed.

EWT theory is used to decompose the monitoring sequence, and the ARIMA prediction models are established for the modal components obtained from the decomposition. On this basis, the predicted results for each component are reconstructed to obtain the prediction sequence pertaining to the monitoring-data sequence.

The residual sequence is obtained by calculating the difference between the predicted value and the actual value and the iForest algorithm is used to identify abnormal points in the residual sequence. These abnormal points are then used to segment the original monitoring sequence.

The improved multi-dimensional SAX vector representation method is used to multi-dimensionally symbolise the segmented sequence and calculate the similarity scores of the symbol vectors on both sides of each abnormal point, so that different abnormal patterns are distinguished by combining the decision threshold.

From the perspective of ensuring the safe and stable operation of the equipment, when an abnormal point of the monitoring sequence is determined to be an invalid abnormal mode, it is necessary to verify the determination by combining with the correlation sequence. If there is no abnormal point in the correlation sequence of the monitoring sequence at the same or adjacent time, the abnormal point can be determined as being an invalid abnormal pattern; if there is abnormal point in the correlation sequence at the same, or an adjacent time, the abnormal point is classified as a valid abnormal pattern, and the abnormality may have been due to an abnormal change in the operating state of the power transformer, or the related monitoring quantity is simultaneously subject to interference from external factors during the measurement or transmission process, which requires further intervention and judgment of staff.

To verify the effectiveness of the proposed anomaly detection method, the on-line monitoring data from a 500-kV transformer in a substation collected 8 from November 2017 to 17 January 2018 are used as an example for analysis. The on-line monitoring data include seven parameters such as H_{2} concentration, top layer oil temperature, and load. The sampling interval of each monitoring parameter is 4 h, so each time series contains 426 data points.

Taking the top layer oil temperature monitoring sequence as an example for analysis, according to the detection process described above, the grey correlation algorithm is used to calculate the correlation between it and other monitoring sequences. According to the calculated results (_{m}

Monitoring indicators | Grey relation |
---|---|

load | 0.730 |

H_{2} |
0.624 |

CH_{4} |
0.672 |

C_{2}H_{6} |
0.698 |

C_{2}H_{4} |
0.690 |

C_{2}H_{2} |
0.608 |

CO | 0.600 |

CO_{2} |
0.629 |

The oil temperature sequence to be detected is shown in ^{th} sampling point to 0 as a missing value and added noise at the 172^{nd} and 412^{th} sampling points. The processed sequence is shown in

The oil temperature monitoring sequence after adding the abnormal point is modelled by EWT and ARIMA time series to obtain the residual sequence that can reflect the abnormal characteristics of the monitoring data (^{nd}, 230^{th}, and 412^{th} sampling points of the monitoring sequence, which are consistent with the position of the abnormal points set here.

The sequence is segmented based on the aforementioned abnormal point position information, and each segment is expressed as a symbol vector using the improved multi-dimensional SAX vector representation method. Then, the similarity score of symbol vectors on both sides of each abnormal point is calculated. According to the calculated results in

Abnormal point | Similarity score | Judgement result |
---|---|---|

172 | 0.269 | invalid abnormality |

230 | 0.203 | invalid abnormality |

412 | 0 | invalid abnormality |

Taking the data from the monitoring of C_{2}H_{4} gas concentration as an example for analysis, the calculated grey correlations between C_{2}H_{4} and other monitoring sequences are listed in _{2}H_{4} concentration sequences include H_{2}, CH_{4}, C_{2}H_{6}, and C_{2}H_{2} concentration sequences. Among them, the correlation between CH_{4} and C_{2}H_{4} concentrations is the highest, and the on-line monitoring data of both are shown in

Monitoring indicators | Grey relation |
---|---|

load | 0.702 |

H_{2} |
0.793 |

CH_{4} |
0.885 |

C_{2}H_{6} |
0.835 |

oil temperature | 0.690 |

C_{2}H_{2} |
0.833 |

CO | 0.733 |

CO_{2} |
0.538 |

Using EWT theory and the ARIMA model to conduct time sequence modelling on the C_{2}H_{4} concentration monitoring data, a residual sequence that can reflect the abnormal characteristics of the monitoring data is obtained (^{th} and 380^{th} sampling points of the monitoring sequence. On this basis, the above abnormal points are used as the segment points of the sequence and, on the basis of the improved multi-dimensional SAX vector representation method to complete the determination of the pattern of the abnormal points, the results are as listed in

According to ^{th} sampling point of the C_{2}H_{4} concentration sequence is 0.257, indicating that the time sequence characteristics before and after the abnormal moment are consistent, and, combined with the decision threshold, it is determined that it is an invalid abnormality pattern. In the process of result verification, the correlation sequence such as CH_{4} has no abnormal value at the same, or adjacent times, therefore, it can be concluded that there are noise data caused by factors such as external environmental interference or an unstable sensor device thereat.

Abnormal point | Similarity score | Judgement result |
---|---|---|

245 | 1.317 | valid abnormality |

380 | 0.257 | invalid abnormality |

The similarity score at the 245^{th} sampling point of the C_{2}H_{4} concentration sequence is 1.317, indicating that the time series characteristics before and after the abnormal moment have changed to a significant extent, and, combined with the decision threshold, it is determined that it belongs to a valid abnormal mode. In addition, the associated CH_{4} concentration monitoring sequence also shows abnormal values at the same time, and the credibility of the above conclusion is further guaranteed. The actual situation at that time is that there are loose, hot parts inside the body of the power transformer, and the contact state is unstable, resulting in unstable staged high-temperature gas production. The results of the abnormal mode determination herein are consistent with the actual situation.

Finally, in the above two examples, we compare the similarity calculation time of each sub segment represented by symbol vector and the similarity calculation time of direct numerical calculation after segmentation, which proves that the improved multi-dimensional SAX vector representation method can significantly improve the computational efficiency. The former is denoted by Method 1 and the latter by Method 2. In Case 1 and Case 2, the calculation time of Method 1 is 10.55% and 5.27% of Method 2 respectively. The comparison of these results is shown in

Case | Method 1 calculation time | Method 2 calculation time |
---|---|---|

1 | 7.57 s | 71.78 s |

2 | 9.30 s | 176.45 s |

Aiming at the problem of abnormal data generated by abnormal changes in equipment status, external environmental interference and communication interruption in the power transformer on-line monitoring system, a method for abnormal detection and pattern differentiation of monitoring data was proposed. The following conclusions are drawn:

Combined with EWT theory and ARIMA model, the online monitoring data were modelled to obtain the residual sequence that can reflect the abnormal characteristics of the monitoring data, and the iForest algorithm is then used to achieve the efficient extraction of abnormal information in the residual sequence.

Based on the in-depth analysis of the difference between the invalid abnormal data and the valid abnormal data, the improved multi-dimensional SAX vector representation method is introduced to symbolise the time series. The similarity score of the symbol vector is then used to measure the difference in the characteristics of the segmented sequences on both sides of the abnormal point, combined with the decision threshold, the effective differentiation of abnormal patterns is realised.

The grey correlation analysis algorithm was used to measure the correlation between monitoring sequences, and the results of an abnormal pattern assessment were further verified on the basis of the correlation between the time series, thus avoiding the limitations of decision threshold setting.

In summary, the proposed anomaly detection technology framework can provide key technical support for efficient cleaning of power transformer on-line monitoring data and accurate assessment of equipment operating status.