When learning the structure of a Bayesian network, the search space expands significantly as the network size and the number of nodes increase, leading to a noticeable decrease in algorithm efficiency. Traditional constraint-based methods typically rely on the results of conditional independence tests. However, excessive reliance on these test results can lead to a series of problems, including increased computational complexity and inaccurate results, especially when dealing with large-scale networks where performance bottlenecks are particularly evident. To overcome these challenges, we propose a Markov blanket discovery algorithm based on constrained local neighborhoods for constructing undirected independence graphs. This method uses the Markov blanket discovery algorithm to refine the constraints in the initial search space, sets an appropriate constraint radius, thereby reducing the initial computational cost of the algorithm and effectively narrowing the initial solution range. Specifically, the method first determines the local neighborhood space to limit the search range, thereby reducing the number of possible graph structures that need to be considered. This process not only improves the accuracy of the search space constraints but also significantly reduces the number of conditional independence tests. By performing conditional independence tests within the local neighborhood of each node, the method avoids comprehensive tests across the entire network, greatly reducing computational complexity. At the same time, the setting of the constraint radius further improves computational efficiency while ensuring accuracy. Compared to other algorithms, this method can quickly and efficiently construct undirected independence graphs while maintaining high accuracy. Experimental simulation results show that, this method has significant advantages in obtaining the structure of undirected independence graphs, not only maintaining an accuracy of over 96% but also reducing the number of conditional independence tests by at least 50%. This significant performance improvement is due to the effective constraint on the search space and the fine control of computational costs.

Bayesian network (BN) is a network model that expresses the relationship between random variables and joint probability distributions, which can express and reason about uncertain knowledge [

In the construction process of BN, the BN structure must first be determined from the given data, and then the network parameters can be continued to be learned [

As shown in

It can be seen from the literature that the number of 0-order tests of CI test is

Without prior knowledge, using existing methods to construct undirected independent graphs is a huge challenge in terms of time complexity or space complexity. Spirtes first proposed the classic Spirtes, Glymour and Scheines (SGS) algorithm [

On this basis, some scholars have proposed to improve the sorting method [

Later, Cheng et al. [

Wang et al. [

It can be seen from the above references that constraint-based algorithms need to quickly and accurately obtain the constraint space in the early stage of the algorithm. Most algorithms use global node information to constrain, ignoring the unique local neighborhood relationship between nodes. At the same time, excessive dependency testing will consume more computing resources, resulting in low algorithm efficiency. In particular, a large number of high-order CI tests have significantly increased the complexity of the algorithm. Above on this, we proposed a Markov blanket discovery algorithm for constraining local neighborhoods. The algorithm first finds the local neighborhood space of the node accurately by setting the constraint radius, and completes the initialization of the constraints. After that, the establishment of Markov blanket constraint space was completed through low-level CI test, and then the construction of undirected independent graph in BN structure learning was completed. Specifically, the main contributions of this article are as follows:

Firstly, the initial search space is quickly determined by leveraging the dependencies between nodes. Subsequently, the local neighborhood of nodes is constrained by an inter-node constraint radius

Secondly, to decrease the complexity of the CI tests, the Markov blanket discovery algorithm is employed to further refine the set of nodes within the constrained local neighborhood, thereby continuing to reduce the search space.

Finally, low-order CI tests are used to update the Markov blanket set, ensuring the inclusion of correct connected edges in the set and generating an undirected independent graph that accurately represents these connections.

The method proposed in this paper not only uses constraint knowledge to compress the search space, but also limits the structure search space quickly and accurately, while reducing the order of CI tests and the number of CI tests. The advantages of the algorithm are verified through comparative experiments with other algorithms.

The rest of this paper is organized as follows:

The BN consists of a two-element array, namely

In the complete set

Then call

In the complete set

Then it is said that the minimum variable set

Mutual information

Suppose

Therefore, mutual information is used to judge the connectivity between every two nodes in the network. Since mutual information has symmetry, that is

Assuming that

It is said that variables

Given the variable

It is said that under the condition of a given

When testing the CI of nodes

In the process of constructing BN undirected independent graphs, the local neighborhood topology information of nodes is more often ignored, which makes excessive use of CI tests to constrain nodes. Moreover, the excessive randomness of selecting nodes also greatly increases the computational cost. Therefore, in order to improve calculation efficiency and calculation accuracy, we proposed a Markov blanket discovery algorithm for constrain local neighborhoods. Avoid blindly using the CI test, and use the local neighborhood information between nodes to constrain it. And use the Markov blanket discovery algorithm to further reduce the search space. Finally use fewer low-level CI tests to complete the construction of the undirected independent graph. The specific algorithm is implemented as follows:

In the process of constructing independent graphs using Markov blanket algorithm, since the number of Markov covering elements increases exponentially with the increase of the number of nodes, it is necessary to constrain the initial structure of the network. The algorithm initialization starts from an undirected empty graph, and first needs to calculate the mutual information value of all nodes. Use the mutual information value to judge the relationship between each node, thereby introducing the local constraint factor

It can be seen from the above equation that the restriction of the initial model can be completed by setting the restriction factor

The network structure of the BN is generally a connected graph. After the above construction process, an undirected graph may appear in the middle link, which may be disconnected. Therefore, the connectivity of the undirected graph needs to be repaired. It can be known from graph theory that if an undirected graph is a non-connected graph, the undirected graph can be represented by several connected components. Only by ensuring that these connected components are connected to each other, the unconnected graph can be restored to a connected graph. Therefore, it is necessary to repair the connectivity of the current network after the mutual information value is judged.

It can be seen that the current undirected graph network is only established under the condition of mutual information. If the accuracy of its judgment is further increased, the Markov blanket needs to be corrected twice by other means.

After the completion of the network initialization and construction in the previous stage, using mutual information value constraint judgment, more edges are added to the empty graph, and because the constraint factor

As a result, in the second phase of the algorithm, through the local characteristic information between nodes and the CI test, the false positive connection edges are eliminated, and the potential missing edges are found, and finally a Markov blanket with higher accuracy is established. Therefore, in this phase, we will conduct two CI tests.

As mentioned in the previous chapter, when there is a connection between two nodes, the CI test accuracy of adjacent nodes is relatively high, and the amount of calculation is low. In order to reduce the number of invocations of the CI test and reduce the computational cost, the neighboring nodes should be used first to perform the CI test. For this reason, a constraint radius

Using the calculated constraint radius

In order to prevent the false deletion of true positive connected edges after the second step of the algorithm is executed. That is, some missing nodes are not added to the Markov cover set, and to avoid the possibility of incompletely connected graphs in the current graph model. Therefore, the last part of the algorithm fixes this problem. First, reconfirm its connectivity and repair it, and secondly, continue to use reliable CI tests to complete this part. On the basis of obtaining the undirected graph of the second step algorithm, use CI to test the independence relationship between computing nodes, and correct the nodes in the Markov blanket set. In particular, this part of the adjustment will no longer delete edges. Finally, we can get a complete connected undirected independent graph

In this section, we will analyze the time complexity of the algorithm, and use the worst-case time complexity as the basis for judgment. Next, we will discuss the time complexity of each step separately. Assuming there are

In Algorithm 1, we first initialize the network structure, which has a time complexity of

In Algorithm 2, the initial calculation of the constraint radius

In Algorithm 3, there are steps similar to those in Algorithms 1 and 2. The most time-consuming step is the CI tests, which have a time complexity of

In order to verify the performance of this algorithm, the experiment is divided into two parts in total. The first part determines the value of the algorithm parameters, and uses the comparison of various indicators under different data sets to determine the generalization of specific parameters. The second part brings the parameter calculation results into the subsequent process, and compares it with the other three similar Markov blanket algorithms to verify the effectiveness of the algorithm. The experimental platform used in our paper is a personal computer with Intel Core i7-6500U, 2.50 GHz, 64-bit architecture, 8 GB RAM memory and under Windows 10. The programs are all compiled using the MATLAB software release R2014a.

In order to verify the two parameters

Parameter experiment setting range,

Stander | Estimation | ||
---|---|---|---|

True | False | Total | |

True | True Positive (TP) | False Negative (FN) | True (T) |

False | False Positive (FP) | True Negative (TN) | False (F) |

Total | Positive (P) | Negative (N) | ALL |

Note: False Negative: The prediction result is false, and the prediction error is the actual truth. False Positive: The prediction result is true, and the prediction error means that the actual situation is false. True Negative: The prediction result is false, the prediction is correct, the actual is false. True Positive: The prediction result is true, the prediction is correct, the actual is true.

This article uses the following four indicators to determine the performance of the experiment [

Accuracy:

Euclid Distance:

True positive rate: Sometimes called sensitivity

False positive rate: Sometimes called specificity

The experiment uses 6 different sample data of four standard data sets, namely AMARM network, CHILD3 network, CHILD5 network and CHILD10 network, each network has 500, 1000 or 5000 sets of data. The specific information of the data is shown in

Datasets | Nodes | Edges | Data size |
---|---|---|---|

ALARM | 37 | 46 | 500, 1000, 5000 |

CHILD3 | 60 | 79 | 500, 1000, 5000 |

CHILD5 | 100 | 126 | 5000 |

CHILD10 | 200 | 257 | 5000 |

The horizontal axis of the experimental results represents the results corresponding to different parameter values of

For the metrics of true positive rate and false positive rate, they respectively reflect the proportions of correct edges and incorrect edges among total edges. Across different datasets of the ALARM network, both metrics show an increasing trend in error rates as

In order to find a reasonable parameter setting for generalization, the algorithm’s testing dataset is further expanded. Subsequent observations will focus on the situations in other networks.

In order to verify the effectiveness of the algorithm, the algorithm in this paper is compared with other three algorithms, PC [

Dataset | CLN-MB | PC-MB | REC-MB | EEMB | DCMB | |
---|---|---|---|---|---|---|

ALARM-500 | ACC | 0.9659 | 0.9403 | 0.9644 | 0.9573 | |

SNCC | 2439 | 3813 | 3639 | 2048 | ||

SCO | 2152 | 3869 | 2980 | 1721 | ||

TIME | 2.5694 | 2.3879 | 3.5030 | 2.8431 | ||

ALARM-1000 | ACC | 0.9687 | 0.9374 | 0.9603 | 0.9644 | |

SNCC | 3533 | 4052 | 3968 | 1921 | ||

SCO | 3207 | 4491 | 3579 | 3309 | ||

TIME | 4.3859 | 5.2727 | 5.2430 | 6.9575 | ||

ALARM-5000 | ACC | 0.9488 | 0.9758 | 0.9701 | 0.9733 | |

SNCC | 5610 | 4796 | 4929 | 3098 | ||

SCO | 7362 | 6970 | 5988 | 5621 | ||

TIME | 28.9073 | 26.2061 | 23.6907 | 22.1215 |

Dataset | CLN-MB | PC-MB | REC-MB | EEMB | DCMB | |
---|---|---|---|---|---|---|

CHILD3-500 | ACC | 0.9612 | 0.9721 | 0.9765 | 0.9730 | |

SNCC | 4251 | 9725 | 8201 | 2874 | ||

SCO | 3161 | 8396 | 5385 | 3011 | ||

TIME | 4.0198 | 4.9793 | 5.8778 | 3.8801 | ||

CHILD3-1000 | ACC | 0.9645 | 0.9758 | 0.9781 | 0.9762 | |

SNCC | 6463 | 10,648 | 8867 | 2939 | ||

SCO | 6871 | 12,230 | 6564 | 3947 | ||

TIME | 3.8979 | 5.9702 | 9.0821 | 7.8169 | ||

CHILD3-5000 | ACC | 0.9716 | 0.9802 | 0.9831 | 0.9825 | |

SNCC | 13,799 | 12,524 | 10,638 | 3506 | ||

SCO | 25,525 | 18,152 | 9984 | 4039 | ||

TIME | 41.6837 | 80.3910 | 27.4013 | 14.2084 | ||

CHILD5-5000 | ACC | 0.984 | 0.9862 | 0.9903 | 0.9819 | |

SNCC | 25,795 | 33,719 | 25,612 | 5092 | ||

SCO | 46,233 | 46,888 | 20,400 | 6994 | ||

TIME | 73.8609 | 126.222 | 63.8608 | 47.3822 | ||

CHILD10-5000 | ACC | 0.9917 | 0.9967 | 0.9951 | 0.9970 | |

SNCC | 87,071 | 141,379 | 62,991 | 9254 | ||

SCO | 60,975 | 201,652 | 92,497 | 13,958 | ||

TIME | 168.9319 | 559.1701 | 240.1646 | 145.2847 |

As can be seen from

For the CHILD3, CHILD5 and CHILD10 dataset in

In contrast to the PC-MB algorithm, because the PC-MB algorithm uses randomly selected nodes for CI testing during initialization. Therefore, the algorithm in this paper can accurately and quickly find and connect nodes with high correlation under the guidance of Markov blanket. This also prevents the algorithm from using high-level CI tests, further saving computational costs. Compared with the EEMB algorithm, the advantage of this paper is that the EEMB algorithm can only update the Markov blanket once, which makes it easy to lose key nodes, so the accuracy rate obtained is low. The algorithm in this paper makes the Markov blanket set more perfect through initialization and second update, and the low-order CI test also ensures the efficiency of the algorithm. The DCMB algorithm has a certain advantage in computational efficiency, but as the dataset size increases, our algorithm demonstrates a greater advantage in the number of CI test calls. From the experimental data, it is evident that compared to other algorithms, the algorithm presented in this chapter achieves higher accuracy with fewer conditional independence tests and using lower-order tests. This capability primarily stems from the adjustment of distance parameters to reduce computational complexity after the initialization phase of the algorithm. In the case of the ALARM network, where the network size contains relatively less local information compared to other datasets in this paper, the advantage of this approach is less pronounced in terms of computational results. However, the algorithm’s ability to achieve high accuracy with reduced testing frequency and lower-order tests showcases its efficiency and effectiveness across different datasets and network complexities.

We proposed a Markov blanket discovery algorithm based on local neighborhood space, which restricts the spatial range of the initial set by constraining the local neighborhood space of nodes. At the same time, the Markov blanket discovery algorithm is used to complete the constraint on the search space, and the two effectively reduce the frequency of use of the CI test. The establishment of local constraint factors greatly reduces the use of high-order CI test through experimental simulation, the values of the two parameters proposed by the algorithm are first determined. Under the same network model, through different datasets compared with other algorithms. The algorithm in this paper has a higher accuracy rate and uses fewer CI tests and lower-level CI tests. In future work, how to achieve the accuracy of the algorithm under a small data set can be used as a research content.

The authors wish to acknowledge Jingguo Dai and Yani Cui for their help in interpreting the significance of the methodology of this study.

This work is supported by the National Natural Science Foundation of China (62262016, 61961160706, 62231010), 14th Five-Year Plan Civil Aerospace Technology Preliminary Research Project (D040405), the National Key Laboratory Foundation 2022-JCJQ-LB-006 (Grant No. 6142411212201).

The authors confirm contribution to the paper as follows: study conception and design: Kun Liu, Peiran Li; methodology: Yu Zhang, Xianyu Wang; validation: Kun Liu, Ming Li, and Cong Li; formal analysis: Kun Liu, Peiran Li, Yu Zhang, Jia Ren; investigation: Ming Li, Cong Li; resources: Xianyu Wang; data curation: Kun Liu, Peiran Li and Ming Li; draft manuscript preparation: Kun Liu, Peiran Li; writing—review and editing: Kun Liu, Yu Zhang; supervision: Yu Zhang, Jia Ren; project administration: Jia Ren, Xianyu Wang, and Yu Zhang; funding acquisition: Jia Ren and Cong Li. All authors reviewed the results and approved the final version of the manuscript.

Data available on request from the authors. The data that support the findings of this study are available from the corresponding author, Yu Zhang, upon reasonable request.

Not applicable.

The authors declare that they have no conflicts of interest to report regarding the present study.