With the rapid development of the economy, the scale of the power grid is expanding. The number of power equipment that constitutes the power grid has been very large, which makes the state data of power equipment grow explosively. These multi-source heterogeneous data have data differences, which lead to data variation in the process of transmission and preservation, thus forming the bad information of incomplete data. Therefore, the research on data integrity has become an urgent task. This paper is based on the characteristics of random chance and the Spatio-temporal difference of the system. According to the characteristics and data sources of the massive data generated by power equipment, the fuzzy mining model of power equipment data is established, and the data is divided into numerical and non-numerical data based on numerical data. Take the text data of power equipment defects as the mining material. Then, the Apriori algorithm based on an array is used to mine deeply. The strong association rules in incomplete data of power equipment are obtained and analyzed. From the change trend of NRMSE metrics and classification accuracy, most of the filling methods combined with the two frameworks in this method usually show a relatively stable filling trend, and will not fluctuate greatly with the growth of the missing rate. The experimental results show that the proposed algorithm model can effectively improve the filling effect of the existing filling methods on most data sets, and the filling effect fluctuates greatly with the increase of the missing rate, that is, with the increase of the missing rate, the improvement effect of the model for the existing filling methods is higher than 4.3%. Through the incomplete data clustering technology studied in this paper, a more innovative state assessment of smart grid reliability operation is carried out, which has good research value and reference significance.

With social progress and economic development, the scale of the power grid is increasing brutally and expanding continuously. The voltage level of the power grid is constantly improving, and the UHV (ultra-high voltage), large-capacity and long-distance transmission have become a reality. The reliable and safe operation of power grid is related to the development of economy and people’s livelihood, and the investment in operation and maintenance equipment of the power grid enterprises is increasing. Ubiquitous Power Internet of things aims to cover all aspects of energy production, transmission and consumption, establish a “second network” based on the existing power network, and form the energy Internet together with the smart grid to realize the transformation of SGCC (State Grid Corporation of China) to a world-class energy Internet enterprise through the construction of ubiquitous power Internet of things. Energy Internet has the characteristics of holographic perception of power grid status, the comprehensive connection of operation data, online company business, a new experience of customer service and open sharing of energy ecology [

The power grid should realize the desire of comprehensive state perception, efficient information processing and convenient and flexible application. Ubiquitous Power Internet of Things aims to cover all aspects of energy production, transmission and consumption, establish a “second network” on the basis of the existing power network, and form the energy Internet together with the smart grid, so as to realize the transformation of SGCC companies to world-class energy Internet enterprises through the construction of ubiquitous Power Internet of Things.

Gholami et al. [

Incomplete data of power equipment is typical semi-structured data, including equipment model, voltage level, manufacturer, discovery time and nature, so it can be used as structured data [

The main innovations of this paper are:

Construct a power equipment data structure, adopt a power equipment state mode of fuzzy mathematics for numerical data and an assessment mode of a pow equipment state assessment guideline for non-numerical data, and finally fuses that data.

A clustering method for establishing different power equipment data types according to different data types is proposed.

An array-based Apriori algorithm is proposed to mine incomplete structured data.

The overall process of the research framework in this paper is shown in

Fuzzy clustering is used to integrate multi-source heterogeneous data into a unified system, and the initial weight phasor of an element in the index layer relative to the middle layer is obtained through calculation [

To establish the associated index set of power equipment, the membership function can be used to solve the problem of the degree to which an element belongs to the boundary fuzzy set, and the value range is between 0 and 1 [

Association rules can be expressed by the implication

The things to be associated in the association rule are called items, which are the components of the itemset I. The length of the itemset is used to represent the number of items in the itemset. A k-itemset is an itemset that contains k items. The sample set Y to be mined is defined as a subset of the item set, and the sample database D includes all samples [

The support degree of the association rule

The confidence of an association rule

In addition, the minimum support

The mining of Apriori algorithm is divided into two stages:

In sample data, an iteration method is adopted to calculate and retrieve the itemsets of which the support degree is not less than the minimum support degree to obtain frequent itemsets.

Select the frequent itemsets with confidence not less than the minimum confidence to obtain the strong association rules [

The operation process of the Apriori algorithm in the execution of the above two stages is to first calculate the support, retain the itemsets that are greater than or equal to the minimum support (meet the minimum support), that is, to obtain “order 1 frequent itemsets

To reduce the number of times the program scans the database space and to speed up the program processing, two important properties need to be used in the algorithm:

If an itemset

If the itemset

Incomplete data includes incomplete key information of power equipment such as manufacturer, equipment type, equipment components and voltage level, as well as peripheral key information such as discovery time, service location and nature. To ensure convenient and accurate reading, these unstructured data are structured [

Variables | Device components |
---|---|

A4_1 | Ontology |

A4_2 | Casing |

A4_3 | Cooling system |

A4_4 | Tap changer |

A4_5 | Non-electric quantity protection |

Select three voltage levels and number them A5_1-A5_3, as shown in

Variables | Voltage grade |
---|---|

A5_1 | 66 kV |

A5_2 | 220 kV |

A5_3 | 500 kV |

The time of discovery and the place of service are integrated as an important basis for the prediction of the operating environment of power equipment. Among them, the discovery time is classified by month, and there are 12 categories that can be structured, which are A6_1-A6_12. The location is divided into 14 categories according to the source administrative region of the incomplete data of the current analysis of power equipment, which are A7_1-A7_14. The final index nature is divided into 3 categories according to the Classification Standard for Primary Power Transmission and Transformation Equipment [

Variables | Nature of the defect |
---|---|

A8_1 | General |

A8_2 | Serious |

A8_3 | Critical |

For the data that is larger and closer to the warning value, the initial value is q and the warning value is c4. The value range of the state quantity index i is divided into four intervals, which are respectively represented as

Membership function of increasing and decreasing type

In the formula, A represents four state levels,

Degraded membership function

In the formula, s = 1, 2, 3 and 4 represent four state levels,

Equations in display format are separated from the paragraphs of the text. Equations should be flushed to the left of the column. Equations should be made editable. Displayed equations should be numbered consecutively, using Arabic numbers in parentheses. See

Considering the characteristics of incomplete data of power equipment and the problem of low mining efficiency caused by the repeated scanning of the database in the mining process of the Apriori algorithm, the frequent item mining based on an array is selected to improve the mining efficiency. The specific improvement idea of the algorithm is as follows:

The transaction database is divided into different transaction units according to different manufacturers, and the data in these transaction units are stored in two-dimensional arrays.

Mine frequent items until no

Merge the obtained frequent itemsets to obtain the merged high-order frequent item sets, then calculate the confidence degree. Keep the itemsets that are more than or equal to the minimum confidence degree (meeting the minimum confidence degree), namely obtain the strong association rule, and finally find the strong association rule. The key information and types are marked with corresponding structured identifications. The transaction units are divided according to the manufacturer, and the structured data in each unit is stored in a two-dimensional array. The data is scanned and compared with the data in the thing database, the data existing in the two-dimensional array is represented by ‘1’, and the data not existing in the two-dimensional array is represented by ‘0’. The Boolean distribution data is obtained by processing in this way so that the counting is convenient, the support degree is calculated, the frequent itemsets are obtained, and the strong association rules are found. The clustering model flow of the array-based Apriori algorithm is shown in

Quantitative evaluation method for non-numerical data

Because the non-numerical data does not have the characteristics of numerical continuous change, it is difficult to use the general fuzzy distribution membership function to quantify the membership degree of each state quantity index to the state level. It is difficult to quantify the weight of each index and the value of equipment deterioration by using fuzzy mathematics. The evaluation method of reference guideline scoring, that is, the comprehensive scoring of state quantity is obtained by the product of basic scoring value and weight coefficient. The evaluation method of quantitative membership degree of non-numerical data is shown in

Parts | Normal state | Attention state | Abnormal state | Serious state | |
---|---|---|---|---|---|

Total score | Total score | Single item scoring | Single item scoring | Single item scoring | |

Circuit breaker body | <30 | >30 | [ |
[ |
>30 |

Operating mechanism | <20 | >20 | [ |
[ |
>30 |

Shunt capacitor | <12 | >12 | [ |
[ |
>30 |

Closing resistor | <12 | >12 | [ |
[ |
>30 |

The state quantity index evaluation matrix is obtained after the calculation of the membership function, which contains the membership values of the four state levels corresponding to the measured values of each state quantity index. This evaluation matrix A is expressed as:

The maximum membership value of each state quantity index belonging to four state levels is extracted from the evaluation matrix, and the state information phasor is constructed, namely:

Calculate the objective correction coefficient of each state quantity index according to the state information phasor e:

In the formula, n is the number of state quantity indicators. Finally, the comprehensive weight

Collect incomplete data of a power company data transform. Select the important information such as a manufacturer, an equipment type, an equipment component, an equipment model, a voltage level, a discovery time, a service place, a property and that like to be represented by a variable A and correspond faults to be represented by a variable B as objects of data mining and analysis. Minimum support and minimum confidence are set in cluster mining using the array-based Apriori algorithm. Frequent itemsets are generated after mining processing, in which 1 itemset of frequent items is generated, and higher order frequent items are not obtained [

Number | Strong association rules | Support degree % | Confidence degree % |
---|---|---|---|

1 | A1_2 plant-Silicone discoloration | 4.4 | 94.1 |

2 | Casing-crack | 4.1 | 97.5 |

3 | 500 kV-abnormal vibration | 1.8 | 96.6 |

4 | A4 Model 6 Transformer-Oil seepage | 8.4 | 89.4 |

5 | December, January-March-Silicone discoloration | 4.7 | 75.3 |

6 | Site A7_2-Rust | 2.3 | 71.3 |

7 | Transformer type A4_8-Oil seepage/leakage | 6.7 | 87.3 |

The strong association rule in the first row indicates that the respirator silica gel of the transformer produced by Factory 2 is easy to change color. In fact, in the actual mining data, the respirator silica gel of the transformer produced by almost all manufacturers changes color. Here is the most prominent one. Row 2 shows that the casing often cracks. Line 3 indicates that abnormal vibration is more likely to occur during the operation of power transformers of 500 kV voltage class. Lines 4 and 7 show that the transformers with equipment model numbers 6 and 8 are prone to oil seepage/leakage. Similarly, during the excavation process, it was found that almost every type of transformer had more or less oil seepage/leakage [

Number | Strong association rule | Results | Reasons |
---|---|---|---|

1 | A1_2 plant-Silicone discoloration | General defect | Low temperature |

2 | Casing-crack | General defect | Humid climate |

3 | 500 kV—Abnormal vibration | General defect | Low temperature |

4 | A4 Model 6 Transformer-Oil seepage/leakage | General defect | Humid climate |

5 | December, January-March-Silicone discoloration | General defect | Low temperature |

6 | Site A7_2-Rust | General defect | Humid climate |

7 | Transformer type A4_8-Oil seepage/leakage | General defect | Humid climate |

The Apriori rule algorithm proposed in this paper mines the spatial neighborhood information of the samples, and uses the local information of the samples in the spatial distribution to realize the secondary correction and filling of the missing values. To verify the effectiveness of this method, the corresponding experimental results and analysis are given in this section. This paper focuses on the CCA-IR framework to improve the effect of classical filling methods. Thus, this paper selects five classical filling methods, including mean filling, SoftI, KNN filling, MFI and MICE, as the filling methods in the pre-filling stage. These five methods are selected for comparison because they are very representative. Specifically, on the one hand, mean filling, SoftI, MFI and MICE are all based on statistical filling methods, while KNN filling is a typical representative of machine learning filling methods; On the other hand, mean padding, SoftI, KNNI, and MFI are all single padding methods, while the MICE method is a classic algorithm representing multiple padding. The details of the experimental data set are as follows.

In this paper, ten data sets are selected from UCI for experiments.

Dataset | #Sample number | #Attribute number | #Category number |
---|---|---|---|

BCC | 116 | 10 | 2 |

Bal-s | 625 | 5 | 3 |

Car | 1728 | 7 | 4 |

Cro-M | 10546 | 25 | 6 |

Dba | 1372 | 5 | 2 |

Ecoli | 336 | 9 | 8 |

Glass | 214 | 10 | 6 |

Hab | 306 | 4 | 2 |

Iris | 150 | 5 | 2 |

Seg | 210 | 20 | 7 |

In the experiment, incomplete data is obtained from the complete data set on UCI in a random missing way, and the missing rate ranges from 0 to 0.25. Then, the incomplete data is pre-filled with the filling method chosen by the paper. Based on that, the pre-filling result is corrected with the method of missing value correction filling based on the spatial neighborhood information proposed in this paper. Finally, according to the original data set before the correction filling and the final complete set after the correction filling, the NRMSE values and change trends under different missing rates are obtained:

In the formula,

To avoid the influence of the error caused by single filling on the experimental results, the experiment repeated random deletion for 100 times under each deletion rate, and finally obtained the NRMSE value, which is the mean value obtained after deletion of 100 repetitions.

In the experiment, the incomplete data is obtained by randomly deleting the complete data sets on the ten UCIs in

Generally speaking, for some data sets with fewer features (dimensions) and more samples, when the missing rate is small, the corresponding NRMSE value is also small, when the missing rate reaches the minimum, the NRMSE value often reaches the minimum, which means that the filling effect is the best, but with the increasing missing rate, the corresponding NRMSE gradually shows an increasing trend. This is because the smaller the missing rate is, the less the number of missing attributes is. When the number of samples is larger than the feature (dimension) number, for example, the feature (dimension) number of Car data set is 7. When the missing rate is 1%, the number of missing attributes is 121, and the data itself has 1728 samples, so there are many remaining complete samples that can be used for filling. This is very beneficial for data recovery. With the increase of the missing rate, the remaining complete samples become less and less, and the corresponding NRMSE values also show an upward trend.

Five classical filling methods (including single filling and multiple filling) are selected and compared on the UCI data set. The results show that the framework proposed in this paper can effectively improve the filling effect of existing filling methods on most data sets. Although the improvement effect is not good on a few data sets, from the change trend of NRMSE metrics under different missing rates obtained from the experiment, most of the filling methods combined with the framework usually show a relatively stable filling trend. The framework proposed in this paper can also improve the filling effect of some poor single filling methods to the effect achieved by multiple filling methods.

This paper mainly analyzes the multi-source heterogeneous data of power system, and expounds the method of how to deal with the missing data and error data in the incomplete data of power equipment. The next step is to structure the unstructured information in these incomplete data. Based on the comprehensive analysis of the incomplete data of power equipment and the characteristics of the Apriori algorithm, the array-based Apriori algorithm is used to find the key information in the incomplete data of power equipment and the strong association rules between clustering rules. The experimental results show that the proposed method can indeed achieve the best results on most data sets, and can also improve the classification accuracy of some poor single filling methods to be similar to that of better multiple filling methods.

In this paper, we propose a framework that can be widely applied to the existing filling methods, aiming at the fact that most of the existing filling methods ignore the impact of the spatial distribution information of samples on data recovery, aiming at improving the filling effect.

Pre-filling a sample by using the prior filling method, and correct the filling result obtained by using the prior method.

Find a plurality of spatial neighborhoods with higher similarity with the sample to be filled by introducing a spatial neighborhood information mining method.

Finally, correct the filling result generated by the existing filling method by using the effective information in the space neighborhood of the sample to be filled.

In the future work, based on the incomplete data clustering and condition assessment in this paper, we can further study the fuzzy analysis of power equipment data in the whole system, arrange timely maintenance according to the impact of equipment condition on the system, and provide guidance for staff in the form of visualization.