There is a large amount of information in the network data that we can exploit. It is difficult for classical community detection algorithms to handle network data with sparse topology. Representation learning of network data is usually paired with clustering algorithms to solve the community detection problem. Meanwhile, there is always an unpredictable distribution of class clusters output by graph representation learning. Therefore, we propose an improved density peak clustering algorithm (ILDPC) for the community detection problem, which improves the local density mechanism in the original algorithm and can better accommodate class clusters of different shapes. And we study the community detection in network data. The algorithm is paired with the benchmark model Graph sample and aggregate (GraphSAGE) to show the adaptability of ILDPC for community detection. The plotted decision diagram shows that the ILDPC algorithm is more discriminative in selecting density peak points compared to the original algorithm. Finally, the performance of K-means and other clustering algorithms on this benchmark model is compared, and the algorithm is proved to be more suitable for community detection in sparse networks with the benchmark model on the evaluation criterion F1-score. The sensitivity of the parameters of the ILDPC algorithm to the low-dimensional vector set output by the benchmark model GraphSAGE is also analyzed.

Network data has become one of the common data structures in our daily life, and community detection is an important part of data mining in networks. In the networks data formed by data nodes, the nodes inside the community are more closely connected than the nodes outside the community. The community structure is defined as the interconnection of a collection of graph-like node data in the real sense. The main purpose of community detection is to uncover the community information hidden in the complex network data structure. With the development of big data and the Internet, the variety and quantity of network data are growing rapidly, so it also poses a challenge to community detection in network data structures. Community detection based on graph structure is of great relevance, and existing network data is used in recommender systems [

Networks are usually represented as adjacency matrices, which can effectively carry a large amount of edge connectivity information in graph data structures. However, with the explosion of existing data growth, The size of graph data is increasing and the number of nodes on the graph is also increasing. The classical community detection algorithm is out of its capacity. Moreover, the adjacency matrix formed by the existing network data suffers from the problem of sparse edges and a large number of nodes. Many existing classical community detection methods have also been applied on a large scale [

The DeepWalk model proposed by Perozzi et al. [

Therefore, this paper uses GraphSAGE unsupervised representation learning for the specific task of community detection, combined with an improved density peaking algorithm for community detection. The model is able to make full use of the adjacency information of graph nodes and fuse the first-order and second-order similarities of nodes to obtain a vector representation of graph nodes on a low-dimensional space. The adjacency matrix of the graph is first mapped to the low-dimensional vectors by GraphSAGE, and then the community detection is implemented using the improved density peaking algorithm to obtain the appropriate class clusters. The algorithm proposed in this paper can solve the community detection problem with attributed graph data, and in order to verify the adaptation performance of the improved density peaking algorithm proposed in this paper on the community detection task. Experiments show that the algorithm verifies the effectiveness of the proposed model on the evaluation metric F1-score.

Graph representation learning provides new ideas in solving the feature representation of graphs, and the GraphSAGE graph representation learning demonstrates unimpressive performance. For a graph defined as _{i} _{1}, _{2}, _{3}_{n}

Given a dataset

The framework proposed in this paper consists of two main phases: the first phase is the training phase, where each node is mapped to a highly nonlinear space by aggregating the information of its neighboring nodes and then input to the next layer of convolution. A series of positive and negative nodes are taken using node-positive and negative sampling, finally, the parameters of the model are updated by backpropagation through the training of the loss function, and the model is able to map the graph data to a low-dimensional vector space. And the representation of low-dimensional vectors has features such as aggregation and community. The second stage is clustering, where the graph embedded vector data output from the first stage model is further clustered to complete the community detection task of the network data. In this paper, an improved local density peak clustering algorithm is proposed to solve the downlink assignment problem generated in the first stage, which can effectively fuse the global information of the graph and complete the community detection.

The low-dimensional vector embedding of graphs can be represented as an unsupervised learning problem for graphs, where the nodes in the complex data of the graph are first sampled, and then the model represents the nodes as vectors in the form of outputs.

In the preparation phase of the data, the attributes of the graph nodes are first normalized. In order to better find the optimal solution of the model, solve the problem of long training time of the model and speed up the convergence of the model. The forward propagation of the graph data structure is propagated to the next layer by the aggregation of the neighboring nodes of nodes. In order to consider the computational efficiency of the model, GraphSAGE uses random sampling of the neighboring nodes of the current node and achieves propagation by aggregating the information of the neighboring nodes. The propagation method used in this paper is expressed by

where _{v}^{k − 1}_{u}^{k − 1}_{v}^{k} by the activation function.

In this paper, the output values are mapped to the nonlinear space using a modified linear unitary

The low-dimensional vector embedding of the graph first samples the nodes in the graph complex data in the input network data for positive and negative nodes to train the loss function, respectively. The cross-entropy loss function is.

where _{i} denotes _{i} is the forward propagation output value of data

Inspired by the cross-entropy function, the graph complex network is analyzed. The loss function for backpropagation in the network is designed to be able to retain similar vector features for nodes that are adjacent or close to each other and to differentiate as much as possible for nodes that are far apart. The sampled node itself and its neighboring nodes are used as positive samples, while the negative samples are collected by random sampling of random wanderings on the graph. The positive and negative nodes and the sampled nodes are then propagated forward to achieve a low-dimensional feature representation of the graph.

The loss function for representation learning using the GraphSAGE framework in this paper is expressed by

where _{v} denotes the vector representation output by the sampled node after forwarding propagation, _{pos} denotes the vector representation output by the positive sample after forwarding propagation, and _{neg} is the vector representation output by the negative sample after forwarding propagation.

In this paper, the W-weight parameters in the model are updated using the adaptive moment estimation (Adam) optimizer. A low-dimensional representation of each node is constructed, and the backpropagation of the model is achieved using the loss function proposed in

The density peaks algorithm clustering by fast search and find of density peaks(DPC) [_{i} _{1}, _{2}, _{3} _{n}_{i,} and finally completing the decision map. The local density parameter of each data point is defined as

Where _{ij} is the Euclidean distance between data point _{i} and data point _{j}, and

The centroids of the class clusters are selected by calculating the local density _{i} of each data point and the minimum distance value _{i} and large minimum distance values

The ILDPC proposed in this paper is an improvement of the local density in the DPC algorithm so that it can be adapted to large-scale low-dimensional dense data sets. As shown in

To solve the above problem, the boundaries of class clusters in the dataset can be better separated by the algorithm. For each low-dimensional vector _{i} _{1}, _{2}, _{3} _{n}_{i}, the sum of distances _{i} from data points within the truncation distance of data point _{i} to data point _{i} in this algorithm is shown in

Thus the improved local density _{i} corresponding to each data point is defined as shown in

The improved local density in ILDPC first calculates the number of data points within the data point truncation distance, Then calculate the sum of the distances from the data points to _{i} within the truncation distance of data point _{i}, because the data points in the class cluster are more closely connected to each other, the points around the center of the class cluster are closer together and their sum of distance _{i} is smaller, while the incorrectly selected centroids have a larger sum of distance _{i}. The improved local density relies more on the localization among the data, and the correct centroids are more locally dense compared to the incorrect centroids and can therefore be selected. The improved local density improves the differentiation between sample class clusters, which in turn has better adaptability to handle large-scale dense data.

For the whole community detection After obtaining the low-dimensional vector embedding of the nodes using the GraphSAGE framework, it is then necessary to classify the nodes with a suitable clustering algorithm for the low-dimensional vectors embedded in the network data. The result of the clustering algorithm at this point can reflect the performance of the whole model and can also represent the goodness of the whole community detection task. In this paper, we use the F1-score, which is the sum of the average of precision and recall, as a common metric for evaluating clustering to measure the community detection results with labeled network data. Precision is defined as.

The recall is defined as.

The specific F1-score formula is defined as.

Precision is defined as the proportion of positive samples among those with positive predictions. The recall is defined as the proportion of positive samples among all samples with positive predictions.

This paper explores community detection methods for unsupervised clustering on complex graph data structures. Also to verify the effectiveness of the proposed method in this paper, experiments are conducted on the datasets Cora, Computer [

Dataset | Number of node categories | Number of nodes | Number of edges |
---|---|---|---|

Cora | 7 | 2708 | 5278 |

Computer | 10 | 13752 | 491722 |

PubMed | 3 | 19717 | 88648 |

In order to verify the effectiveness of community detection, the experimental design of GraphSAGE is combined with the downstream tasks K-means, DPC, and this algorithm ILDPC The DPC and the present algorithm ILDPC algorithm belong to the density clustering algorithm. In order to ensure the fairness of the experimental results, the best experimental results are generated by comparing the comparative analysis of each algorithm on large sparse graph datasets, and the parameters are selected several times in this paper. In the graph representation learning phase, the attribute values of the dataset are first normalized, and then the training of the model begins. The model parameters are set as follows:

In this paper, we use micro-averaged F1-scores (Micro-F1) to measure the experimental results; the larger this metric is, the more desirable the results are. This also indicates that good preservation of network topology information was maintained in the upstream task of the experiment. The experiments were carried out on datasets with Cora, Computer and PubMed attributes. A low-dimensional embedding representation of the node vectors is accomplished by training the model parameters so that the network data can distinguish as much as possible between nodes that are far away from each other and nodes that are close to each other. Then the node classification of the embedded node vectors is performed to complete the community detection task. Different clustering algorithms have different performances on the community detection problem, and to illustrate the adaptability of the ILDPC algorithm on the community detection task.

The algorithm is compared with different clustering algorithms applied on the community detection task to verify the adaptability of the algorithm on community detection. The training process of the above dataset on using different clustering algorithms is shown in

It can be seen that the ILDPC algorithm paired with the community detection task achieved good results. As can be seen from

In this section, decision diagrams are required in the DPC algorithm to select the density peak points. the proposed ILDPC algorithm is better able to find the data points with density peaks. In this paper, the last epoch is selected to complete the decision diagram. As shown in

The proposed method in this paper can effectively handle large-scale network structure data and community detection tasks after upstream tasks.

To further determine the impact of experimental parameters in the ILDPC algorithm on the community detection task and the validity of the experimental results. This section explores the effect of the choice of different truncation distances on the clustering results on three datasets Cora, Computer, and PubMed.

The DPC algorithm and the ILDPC algorithm require only one truncation distance in the choice of parameters. However, in the clustering task dealing with large-scale data sets, the choice of truncation distance is associated with the cluster centroids and the clustering results are sensitive to the choice of truncation distance. From the experimental results, it is shown that the truncation distance is usually chosen between 1% and 2% of the total number of vectors to achieve the best results. The network embedding vectors output by the network embedding model can effectively distinguish the community structure and map the node vectors from high-dimensional to low-dimensional dense spaces. Good results are achieved in the downstream task of improving density peak clustering. As shown in

In this paper, the community detection task is carried out for each of the three datasets by different algorithms. To ensure a fair comparison of the experiments, repeated experiments are conducted by selecting the best parameters, and the best results are taken for the experimental test results. The performance of the proposed ILDPC algorithm combined with network representation learning on the community detection task is verified by comparing different algorithms. The F1-score values of the different algorithms on the dataset are shown in

In most cases, especially when the adjacency matrix is sparse, classical community detection algorithms are difficult to perform this task effectively on network data. The framework proposed in this paper can effectively adapt to network data and can essentially accomplish the community detection task. In the first stage of the model, the output node vectors are fused with the surrounding information to obtain the node clustering information and node neighbor information. Therefore, the output node low-dimensional vector representation maintains the community structure information to facilitate clustering information mining.

In this paper, we propose a GraphSAGE graph representation learning-based community detection algorithm with an improved density peaking algorithm. The algorithm maps embeddable low-dimensional large-scale graph datasets into dense low-dimensional vectors by GraphSAGE, and preserves the original topological information of the graph for representation learning. Based on this, the ILDPC algorithm is proposed in this paper, which can adapt the large-scale dense vectors to the dense vectors embedded in the model output and complete the community detection. Comparing the decision diagrams drawn by the ILDPC algorithm and the DPC algorithm, it is found that the density peak points in the data can be more accurately selected by the ILDPC algorithm and are more adaptive to the vectors completed by the embedding. Finally, by comparing three sets of real attribute data with K-means and DPC algorithms, The F1-score metric shows that the algorithm can effectively combine the attributes of embedded nodes and the local density of nodes to detect community centroids and complete community detection of network data.