Many network presentation learning algorithms (NPLA) have originated from the process of the random walk between nodes in recent years. Despite these algorithms can obtain great embedding results, there may be also some limitations. For instance, only the structural information of nodes is considered when these kinds of algorithms are constructed. Aiming at this issue, a label and community informationbased network presentation learning algorithm (LCNPLA) is proposed in this paper. First of all, by using the community information and the label information of nodes, the firstorder neighbors of nodes are reconstructed. In the next, the random walk strategy is improved by integrating the degree information and label information of nodes. Then, the node sequence obtained from random walk sampling is transformed into the node representation vector by the SkipGram model. At last, the experimental results on ten realworld networks demonstrate that the proposed algorithm has great advantages in the label classification, network reconstruction and link prediction tasks, compared with three benchmark algorithms.
In the real world, various complex networks are abstracted from social scenarios, such as social networks [
However, with the development of society, the scale of complex networks has changed dramatically, and the link relationships between nodes have become more and more complicated. The more space is needed to store networks. As a result, these traditional complex network analysis methods [
To solve this problem, many scholars gradually put forward NPLA based on the random walk strategy, which simulates natural language processing model to achieve network embedding, so as to map irregular dimensional complex networks into regular lowdimensional vector spaces. In this process, several models have been proposed. For example, the Deep Walk [
Unfortunately, these random walksbased NPLA algorithms can only capture the structural information of networks. That is to say, the community information of nodes fails to not be considered in the construction of NPLA. Aiming at this problem, some scholars have also therewith proposed the community informationbased random walk NPLA. For instance, based on the local structure information and global structure information of network nodes, CARE [
In addition, the link between the label information of nodes is also a kind of very important information in the network. For example, the bipartite graph [
Based on the above analysis and discussion, in this paper we propose a label and community informationbased network presentation learning algorithm (LCNPLA) to capture more information. In LCNPLA, the firstorder neighbors of nodes are reconstructed by using the community information and label information of nodes. Moreover, the degree information and label information of nodes are fused to construct a random walk strategy in LCNPLA. The experimental results on many network data demonstrated that the proposed algorithm achieves a great advantage in the label classification, network reconstruction and link prediction tasks.
The contribution of this paper is as follows:
The firstorder neighbors of nodes are reconstructed by utilizing the community information and label information of nodes.
A random walk strategy is constructed by integrating the degree information and label information of nodes.
The LCNPLA algorithm proposed in this paper can achieve better performance, compared to other benchmark algorithms.
The rest of this article is organized as follows.
In this section, some necessary knowledge is introduced, which are network embedding, random walk sequence and pairwise constraint matrix. For more detailed knowledge, one can refer to the references [
In general, an undirected and unweighted network can be represented as a tuple
The random walk traverses a network starting at a node. At any node, it goes to the neighbor of this node with probability a. A random walk sequence with
Matrix
In order to facilitate understanding of this article, some of the relevant concepts are summarized in
Notations  Descriptions 

The label information vector of 

Weight of label  
Predetermined proportion  
Number of random walks per node  
The degree of 

The set of labels  
The dimension of the node embedding vector  
Node embedding vector matrix  
The learning rate of stochastic gradient descent algorithm 
In this section, we propose LCNPLA and discuss the main components of our algorithm. The specific algorithm flow is divided into the following two parts: Preprocessing process of nodes, Label and Degree strategy (
In
From the above discussion, we intend to use the Louvain [
During the preprocessing process of nodes, the new firstorder neighbors of nodes are generated. In order to sample these, we design a
At the structural level, due to the complexity of node information, the similarity between node and node is asymmetrical even for undirected network. Therefore, KL divergence is selected to measure the structural information of networks based on computational complexity and other factors. So, the KL divergence is used to calculate the difference of the degree distribution of any two nodes as the structural information of the network. The calculation formula of KL divergence is defined as
At the attribute level, because we can’t directly represent the label information of nodes, we first utilize onehot coding to express the label information of nodes. Then, to represent the attribute information of the network, the Hadamard product is used to calculate the correlation of label information of any two nodes.
In this paper, the weight of label information refers to the proportion of the total number of labels corresponding to each type of label information on all nodes. The specific calculation method is as follows:
where
The calculation formula is
where the calculation formula of the
where
The pseudocode of the
The SkipGram model is a language model. Given a predefined window
where for
Under the Conditional independence hypothesis and Symmetry in feature space hypothesis,
To simplify the optimization problem, we made two standard assumptions:
Conditional independence: by assuming that the likelihood of observing a neighborhood node is independent of observing any other neighborhood node, we factorize the likelihood. The mathematical expression of the vector representation of the source is
Symmetry in feature space: a source node and its neighborhood node have a symmetric effect in feature space. Accordingly, we model the conditional likelihood of every sourceneighborhood node pair as a SoftMax unit parametrized by a dot product of their features. The mathematical expression is
With above assumptions, the objective in
where
As we can see from the second term of
where
With
The set of node sequences is entered into the SkipGram model. After training, we obtained the node embedding vector. The pseudocode of the LCNPLA can be shown as following Algorithm 2:
In this section, we introduce some experimental materials, which are the experiment datasets, evaluation criteria and benchmark algorithms. In this paper, the experimental environment is listed in
Parameter  Parameter value 

RAM  62 GB 
Programming  Python 
CPU  13th Gen Intel(R) Core (TM) i913900K 
System  Ubuntu 20.04 
To verify the effectiveness of LCNPLA, experiments are conducted on the following 10 real network data sets, including Polbooks, Adjoun, Football, Europe, USA, Polblogs, Wiki, Cora, Citeseer and PPI. The information of these data sets is listed in
Dataset  < 
< 
< 
< 


Polbooks  105  441  8  0.4875  4  Label classification 
USA  1190  13599  23  0.6090  4  
Adjoun  112  425  8  0.2000  2  Network reconstruction 
Europe  399  5995  30  0.5670  4  
Football  115  613  6  0.3708  12  Link prediction task 
Polblogs  1224  16718  27  0.3610  2  
Wiki  2405  16523  11  0.4800  17  
Cora  2708  5429  4  0.2461  7  Label classification and network reconstruction 
Citeseer  3312  4732  3  0.2590  6  
PPI  3890  76584  19  0.1660  50 
Notes:
Here, the evaluation criteria of MicroF1, MacroF1 and WeightedF1 are used for the label classification task; the MAP evaluation criterion is used for the network reconstruction task; the AUC evaluation criterion is used for the link prediction task.
where
is the recall rate for all labels.
where
is
where
where
is the accuracy of
where
Here, we introduce three network presentation learning algorithms that can be applied to make a comparison with our proposed algorithm. Detailed descriptions of them can be found in references [
The main mechanism of line algorithm can be divided into to two stages to learn the
The main mechanism of the Sdne algorithm can be divided into to two stages to learn the feature representation of nodes. In the first step, it learns the local structure of network by depth automatic encoder. In the second step, it learns the global structure of network by Laplace mapping.
The basic idea of the Struc2vec algorithm is that if two nodes own more similar structures in a network, they should have a higher similarity. The mechanism of it can be divided into four steps. In the first step, structural similarity between each pair of nodes is learned. In the second step, a weighted multilayer network is constructed. In the third step, the multilayer network is used to generate a sequence of nodes for each node. In the fourth step, feature representation of nodes is learned by SkipGram model.
In this section, we focus on testing the effectiveness of the proposed LCNPLA for label classification, network reconstruction, and link prediction task. Detailed descriptions of benchmark algorithms can be found in references [
We use the onevsrest logistic regression that applied in Loglinear [
Labeled nodes%  10%  30%  50%  70%  90%  

LCNPLA_P20  71.62  
LCNPLA_P10  76.84  75.47  81.82  
Line  49.47  39.19  49.06  53.13  18.18  
Sdne  69.47  66.22  66.04  68.75  81.82  
Struc2vec  46.81  51.35  61.54  75.00  50.00  
LCNPLA_P20  56.53  
LCNPLA_P10  59.32  54.98  61.25  54.44  
Line  28.48  27.81  42.52  52.13  11.11  
Sdne  48.94  47.60  48.20  49.47  54.44  
Struc2vec  46.81  51.35  61.54  75.00  50.00  
LCNPLA_P20  76.81  
LCNPLA_P10  73.30  76.13  81.52  
Line  38.48  36.51  48.28  52.94  15.15  
Sdne  64.27  62.53  65.43  67.29  81.52  
Struc2vec  37.97  48.41  58.04  70.71  51.11 
Labeled nodes%  10%  30%  50%  70%  90%  

LCNPLA_P20  24.37  26.77  26.89  
LCNPLA_P10  25.77  25.57  25.21  26.33  30.25  
Line  25.21  27.85  27.23  25.77  26.05  
Sdne  24.97  24.20  26.05  29.41  
Struc2vec  24.63  25.14  31.67  
LCNPLA_P20  24.27  26.29  26.57  
LCNPLA_P10  25.03  24.83  26.18  29.99  
Line  24.93  27.84  27.19  25.78  26.36  
Sdne  20.73  21.25  20.22  22.00  28.08  
Struc2vec  24.48  25.05  31.75  
LCNPLA_P20  24.26  26.32  26.50  
LCNPLA_P10  25.01  24.64  26.07  28.18  
Line  24.90  27.84  27.13  25.66  26.05  
Sdne  20.60  21.14  20.05  21.68  27.71  
Struc2vec  24.50  24.93  31.64 
Labeled nodes%  10%  30%  50%  70%  90%  

LCNPLA_P20  49.30  54.43  
LCNPLA_P10  54.06  55.10  53.14  
Line  22.19  22.84  24.67  23.49  26.20  
Sdne  17.14  17.46  16.63  17.59  14.94  
Struc2vec  30.11  33.12  35.30  37.10  44.12  
LCNPLA_P20  
LCNPLA_P10  45.27  50.85  50.92  51.14  48.09  
Line  22.19  22.84  24.67  23.49  26.20  
Sdne  3.37  3.58  3.88  4.34  3.84  
Struc2vec  18.24  23.97  26.03  26.99  30.92  
LCNPLA_P20  48.30  53.57  
LCNPLA_P10  53.13  53.90  52.09  
Line  19.19  20.11  21.87  19.65  21.88  
Sdne  8.86  9.27  9.35  9.98  8.44  
Struc2vec  25.60  30.05  32.22  33.94  40.68 
Labeled nodes%  10%  30%  50%  70%  90%  

LCNPLA_P20  39.92  41.27  
LCNPLA_P10  34.32  37.73  41.85  
Line  20.16  19.84  18.78  20.22  24.40  
Sdne  26.94  28.29  30.37  31.29  30.12  
Struc2vec  26.63  26.64  29.47  31.99  35.54  
LCNPLA_P20  35.28  
LCNPLA_P10  31.47  34.40  36.17  38.01  
Line  17.77  15.96  15.25  15.97  18.81  
Sdne  17.76  18.81  21.50  22.22  20.41  
Struc2vec  22.85  23.59  25.47  27.08  29.29  
LCNPLA_P20  39.25  
LCNPLA_P10  33.89  36.93  39.18  40.81  
Line  19.50  18.35  17.50  18.59  21.98  
Sdne  20.93  21.87  24.79  25.81  23.65  
Struc2vec  25.26  26.34  29.13  31.21  35.44 
Labeled nodes%  10%  30%  50%  70%  90%  

LCNPLA_P20  7.48  8.47  8.71  
LCNPLA_P10  7.31  8.48  10.25  
Line  7.80  7.99  7.98  8.80  
Sdne  
Struc2vec  7.41  8.01  8.44  8.61  9.24  
LCNPLA_P20  5.70  6.20  6.35  
LCNPLA_P10  5.51  6.08  6.82  
Line  5.98  6.10  6.14  6.76  
Sdne  
Struc2vec  5.34  5.96  6.23  6.16  6.89  
LCNPLA_P20  6.64  7.60  8.11  
LCNPLA_P10  6.57  7.57  8.36  
Line  7.25  7.53  7.74  7.81  
Sdne  
Struc2vec  6.77  7.52  8.02  8.11  8.86 
The experimental results on Polbooks are shown in
However, as the ratio of community information increases, noise may affect the accuracy of the algorithm. As a consequence, the MicroF1, MacroF1, and WeightedF1 scores of the LCNPLA_P20 may be lower than the LCNPLA_P10. Based on the above analysis, compared with the experimental results on USA, Cora and Citeseer, we find that LCNPLA can also obtain better results on small data sets.
The experimental results on USA are shown in
Specially, although the USA has the largest number of edges for Polbooks, USA, Cora and Citeseer, LCNPLA can still enhance the ability of label classification when the training ratio is 70%. And the MicroF1, MacroF1 and Weighted F1 scores of the best LCNPLA are 1.12%, 0.87% and 0.86% higher than that of the best benchmark algorithm. When the training ratio is 90%, the MicroF1, MacroF1 and Weighted F1 scores of the best LCNPLA are 1.10%, 0.81% and 0.46% higher than that of the best benchmark algorithm, respectively.
Based on the above analysis, although the effect of LCNPLA is affected by the training ratio, the advantages of LCNPLA are gradually obvious with the increase of training ratio. When the training ratio increases, it is found that LCNPLA has obvious advantages in a dataset with a large number of edges.
The experimental results on Cora and Citeseer are shown in
The experimental result on PPI is shown in
We rebuild the proximity of the nodes and sort the nodes by proximity and calculate the proportion of true links in the top k predictions as the reconstruction accuracy. And for a network with a large number of nodes, the number of possible node pairs
Here we can see that the performance of LCNPLA is highly data set dependent. It achieves good performance on Polblogs but performs poorly on other data sets. As a result, the reconstruction performance of the Sdne algorithm is better than the best LCNPLA algorithm. The reason may be that the Sdne algorithm collects both local structure information and global structure information. In the
In the
To ensure the accuracy of the LCNPLA, we select the hidden ratio of 15%, 30% and 45% to divide the train set, verification set and test set. Specially, the degree of AUC greater than 0.50 measures how well the algorithm is better than the algorithm selected randomly. And comparison with embedding based methods bootstrapped using binary operators: Hadamard, WeightedL1 and WeightedL2 (See
Operator  Symbol  Definition 

Hadamard  
WeightedL1  
WeightedL2 
Operator  Algorithm  Football  Polblogs  Wiki 

LCNPLA_P20  0.5837  0.5736  
LCNPLA_P10  0.5746  0.7202  0.5805  
Hadamard  Line  0.4725  0.5843  0.5539 
Sdne  0.6336  
Struc2vec  0.4389  0.6304  0.5528  
LCNPLA_P20  0.5938  0.7116  0.6066  
LCNPLA_P10  0.6061  0.5980  
WeightedL1  Line  0.5751  0.6513  0.5813 
Sdne  0.6488  
Struc2vec  0.5860  0.6283  0.4308  
LCNPLA_P20  0.5893  0.7050  0.6286  
LCNPLA_P10  0.6026  0.6383  
WeightedL2  Line  0.5719  0.6436  0.5609 
Sdne  0.6773  
Struc2vec  0.5736  0.6179  0.4426 
From
For Football, the link prediction ability of the Sdne algorithm is better than that of LCNPLA, this reason may be that it can capture the higherorder structure in the network. For the Hadamard operator, neither the Line nor the Struc2vec has predictive ability. Therefore, in the Hadamard operator, the best LCNPLA is not comparable with the Line algorithm and the Struc2vec algorithm. For the WeightedL1 operator, compared with the Line algorithm and the Struc2vec algorithm, the best LCNPLA gains 3.10% and 2.01%, respectively. For the WeightedL2 operator, gains of 3.07% and 2.90% are obtained, respectively.
For Wiki, the best LCNPLA gains 2.66% and 2.77% for the Hadamard operator, respectively. For the WeightedL1 operator, compared to the Line and the Struc2vec, the best LCNPLA gains 2.53%. For the WeightedL2 operator, 7.74% is obtained. Experimental results show that LCNPLA has better link prediction ability on Polblogs than Wiki. This reason may be that LCNPLA has a better prediction advantage for network data sets with fewer nodes, when the number of edges is similar.
All in all, combining experimental results from Polblogs, Football and Wiki, it is found that the link prediction ability of LCNPLA is not obvious in
The
To further improve the application of node representation vector in downstream tasks, this paper has designed a label and community informationbased network presentation learning algorithm. By integrating the label information of nodes and the community information in the network, the LD strategy is defined to expand the firstorder neighbor nodes of nodes. In this process, the sequence of nodes is generated by the LD strategy, then the node representation vector is generated by the SkipGram model. In 9 real networks, the performance comparison between the proposed algorithm and the other benchmark algorithms was displayed with the help of 5 evaluation metrics. A large number of theoretical derivation and experimental analyses demonstrated that the proposed LCNPLA was more advantageous in the label classification task, network reconstruction task and link prediction task.
<
We are hugely grateful to the possible anonymous reviewers for their constructive comments with respect to the original manuscript.
We are hugely grateful to the possible anonymous reviewers for their constructive comments with respect to the original manuscript. What is more, we thank the National Natural Science Foundation of China (Nos. 61966039, 62241604) and the Scientific Research Fund Project of the Education Department of Yunnan Province (No. 2023Y0565). Also, this work was supported in part by the Xingdian Talent Support Program for Young Talents (No. XDYCQNRC20220518).
S.L.: Responsible for proposing algorithm ideas, analyzing experimental data as well as writing the paper. C.Y.: Responsible for proposing guidance and revising the final version of the paper. Y.L.: Responsible for collecting network data.
All data generated or analyzed during this study are included in this published article.
The authors declare that they have no conflicts of interest to report regarding the present study.