In clustering algorithms, the selection of neighbors significantly affects the quality of the final clustering results. While various neighbor relationships exist, such as

Clustering is one of the most important methods of data mining [

Partitioned clustering algorithms are a kind of high-efficiency clustering algorithms. It uses an iteratively updating strategy to optimize the data until reaches the optimal result. Among the representative ones is the

To effectively discover arbitrary-shaped clusters, density-based clustering algorithms are proposed [

Hierarchical clustering seeks to build a hierarchy of clusters [

Datasets with complex structures generally refer to those that contain clusters with diverse shapes (including spherical, non-spherical, and manifold), sizes, and densities. Traditional clustering algorithms (e.g., DBSCAN,

To show that the proposed neighbor method can effectively identify datasets of any structure, we propose the HC-SNaN algorithm. Inspired by natural neighbors and shared neighbors, we propose a new neighbor method-shared natural neighbor method. This method skillfully utilizes the advantages of natural neighbors and shared neighbors, and more effectively reflects the true distribution of data points in complex structures. The algorithm first constructs a shared natural neighbor graph by using shared natural neighbors and then cuts sub-clusters into multiple cut sub-graphs. Next, we merge the fuzzy points with the cut sub-graphs to form initial sub-clusters. Among them, fuzzy points are points not connected in the sub-graph while cutting the sub-graphs. Finally, based on our proposed new merging strategy, the initial clusters are merged. To show the process of our algorithm more clearly, the flow chart of our proposed HC-SNaN algorithm is shown in

The experimental results on synthetic datasets and real-world datasets show that our algorithm is more effective and efficient than existing methods when processing datasets with arbitrary shapes. Overall, the major contributions of this paper are:

We propose a new neighbor method. This method combines the characteristics of natural neighbors and shared neighbors, it shows excellent performance when dealing with variable-density and noise datasets.

We propose a new method to generate sub-clusters on shared natural neighbors. This method first uses the shared natural neighbor graph to divide the dataset into multiple sub-clusters. Then uses a new merging strategy to merge into pairs of sub-clusters. Thus, satisfactory clustering results can be obtained on multiple datasets.

Experimental results on synthetic and real-world datasets show that the HC-SNaN has more advantages in detecting arbitrary-shaped clusters than other excellent algorithms.

In

In contrast, the natural neighbor method automatically adapts to the dataset is distribution, eliminating the need for manual

Definition 1 (Stable searching state). A stable search state can be reached if and only if the natural neighbor processes satisfy the following conditions:

The parameter

Definition 2 (Natural neighbor). In a naturally stable search algorithm, data points in the immediate neighborhood are considered natural neighbors of each other. Assuming that the search state stabilizes after the

The

A sample and its neighboring data points usually belong to the same cluster category, allowing for a more accurate assessment of the sample’s local density based on its neighboring information. Specifically, the shared region encompassing a sample and its neighbors provides rich local information, aiding in a more precise depiction of the sample’s distribution.

Incorrectly determining similarity can result in misclassifications. The shared region among samples plays a pivotal role in determining sample similarity and local density. Therefore, establishing sample similarity and local density based on the relationships between a sample and its neighbors is crucial for improving clustering accuracy.

Definition 3 (Shared neighbor). For any points

For a sample

Identifying any dataset requires excellent recognition ability for complex datasets. In order to identify datasets with complex structures, it is a good way to divide complex structures into multiple simple structures. HC-SNaN algorithm inspired by this has three steps: the first step is to construct a neighborhood graph representing the structure of a dataset, and then partition the neighborhood graph into several sub-graphs and fuzzy points. The second step is to assign fuzzy points to sub-graphs according to rules. The third step is to merge sub-graphs and obtain the final clustering result. In this section, we will detail the details of the algorithm.

Shared natural neighbors have better adaptability for identifying complex datasets. In comparison to the

The shared neighbor method establishes similarity between data points based on their shared

Inspired by the advantages of the above two neighbor methods, we propose a new neighbor method-shared natural neighbor. This method is not influenced by neighbor parameters and effectively prevents over-aggregation among high-density points. It demonstrates flexibility in adapting to datasets with different density distributions, making it valuable for discovering clusters with arbitrary structures. Its specific implementation is shown in

Definition 4 (Shared natural neighbor). For any two natural neighbor points

The quantity of shared natural neighbor for the pair

The

Definition 5 (Shared natural neighbor similarity). For any two natural neighbor points

Definition 6 (Sub-cluster core point). For natural neighbors

Definition 7 (Local density of natural neighbor). The local density of the natural neighbor is represented denoted as:

Definition 8 (Cluster crossover degree). For

Definition 9 (Shortest distance between sub-clusters). All distances between points in sub-clusters

Definition 10 (Natural neighbors shared between sub-clusters). The natural neighbors shared between sub-clusters are represented as:

In the segmentation stage of HC-SNaN, we first constructed the shared natural neighbor graph

Moving to the merging phase, we calculate the distance of fuzzy points from the sub-cluster core point, as well as the average distance. Next, we determine the clusters to which the natural neighbors of the fuzzy point are subordinate. Finally, the fuzzy point is assigned to the cluster closest to the core point of the sub-cluster, provided it is smaller than the average distance to the core points of other sub-clusters. Fuzzy points that cannot be assigned are designated as local outlier points. Detailed steps are described in Algorithm 2.

Upon completing the assignment of fuzzy points, an initial cluster is formed. Subsequently, we calculate the cluster crossover degree, the shortest distance between sub-clusters, and the natural neighbors shared between sub-clusters. The merging process continues until the desired number of clusters is achieved. Finally, we assign the

The main improvement of the HC-SNaN algorithm: it constructs sparse graphs based on the shared natural neighbor construction, which not only sparse the data but also improves the rarefaction of the algorithm to different datasets. In the merging phase, the algorithm considers the interconnection and compactness within the data. This operation can better capture the features inside the cluster, which is highly effective for processing datasets with different shape densities.

HC-SNaN mainly includes three steps: The first step is to divide the sub-graph. First, we search the natural neighbor

To illustrate the effectiveness of the HC-SNaN algorithm, we conducted experiments on synthetic datasets and real datasets. These datasets have different scales, different dimensions, and different flow structures. The information of these datasets is shown in

Dataset | Samples | Attribute | Clusters | Dataset | Sample | Attribute | Clusters |
---|---|---|---|---|---|---|---|

Pathbased | 299 | 2 | 3 | Wine | 178 | 13 | 3 |

Jain | 372 | 2 | 2 | Haberman | 306 | 3 | 2 |

Compound | 403 | 2 | 3 | Ecoli | 336 | 7 | 8 |

Dadom | 499 | 2 | 6 | Dermatology | 358 | 34 | 6 |

Aggregation | 787 | 2 | 7 | BCW | 683 | 10 | 2 |

T2-4T | 4197 | 2 | 6 | Mnist | 810 | 500 | 7 |

Atom | 499 | 3 | 2 | Contraceptive | 1473 | 9 | 3 |

Chainlink | 499 | 3 | 2 | Page-blocks | 5473 | 10 | 5 |

To further assess the superiority of our proposed HC-SNaN algorithm, this section will present a comparison of our algorithm with seven others. These include recent algorithms: GB-DP algorithm [

Algorithm | AMI | ARI | FMI | Arg. | AMI | ARI | FMI | Arg. |
---|---|---|---|---|---|---|---|---|

Pathbased | Jain | |||||||

DPC | 0.5035 | 0.4196 | 0.6487 | 3.8 | 0.6365 | 0.6965 | 0.8739 | 0.9 |

GB-DP | 0.4934 | 0.4338 | 0.6597 | 3 | 0.2193 | 0.0591 | 0.5947 | 2 |

NDP-Kmeans | 0.5173 | 0.3092 | 0.3759 | 3/0 | 2/0 | |||

HC-LCCV | 0.4699 | 0.2685 | 0.3730 | ~ | 0.5061 | 0.3595 | 0.4698 | ~ |

0.5452 | 0.4642 | 0.6629 | 3 | 0.3677 | 0.3241 | 0.7005 | 2 | |

DBSCAN | 0.6652 | 0.4573 | 0.7279 | 1.81/17 | 0.9713 | 0.9887 | 0.9956 | 2.6/16 |

SNNDPC | 0.9005 | 0.9292 | 0.9528 | 9/3 | 0.4354 | 0.4060 | 0.7397 | 12/2 |

HC-SNaN | 1/3 | 0/2 | ||||||

Compound | Dadom | |||||||

DPC | 0.8383 | 0.6450 | 0.7288 | 9/6 | 0.6287 | 0.3710 | 0.5208 | 2.62/6 |

GB-DP | 0.8003 | 0.7489 | 0.8222 | 6 | 0.7081 | 0.5704 | 0.6623 | 6 |

NDP-Kmeans | 0.8211 | 0.4903 | 0.3464 | 0.7/6 | 0.7766 | 0.4290 | 0.6906 | 0.1/6 |

HC-LCCV | 0.9145 | 0.7470 | 0.4422 | ~ | 0.9057 | 0.7498 | 0.8677 | ~ |

0.7381 | 0.5864 | 0.6798 | 6 | 0.7320 | 0.6305 | 0.7021 | 6 | |

DBSCAN | 0.9039 | 0.8220 | 0.9268 | 1.5/13 | 0.5729 | 0.5380 | 0.8017 | 0.21/2 |

SNNDPC | 0.8471 | 0.8372 | 0.8764 | 16/6 | 0.8477 | 0.7354 | 0.7854 | 2/6 |

HC-SNaN | 1/6 | 0/6 | ||||||

Aggregation | T2-4T | |||||||

DPC | 0.9341 | 0.9199 | 0.9380 | 0.9 | 0.6933 | 0.4690 | 0.5945 | 3.59 |

GB-DP | 0.6557 | 0.5362 | 0.6359 | 7 | 0.4470 | 0.2876 | 0.4473 | 6 |

NDP-Kmeans | 0.9245 | 0.7778 | 0.8950 | 7/0.3 | 0.8982 | 0.7589 | 0.8965 | 0.1/6 |

HC-LCCV | 0.8894 | 0.6792 | 0.4026 | ~ | 0.0398 | 0.0060 | 0.4936 | ~ |

0.8391 | 0.7112 | 0.7722 | 7 | 0.6547 | 0.4835 | 0.5952 | 6 | |

DBSCAN | 0.8166 | 0.6052 | 0.8062 | 0.951/4 | 0.8450 | 0.6613 | 0.8608 | 0.07/5 |

SNNDPC | 0.9548 | 0.9593 | 0.9681 | 15/7 | 0.7064 | 0.5073 | 0.6244 | 16/6 |

HC-SNaN | 2/7 | 8/6 | ||||||

Atom | Chainlink | |||||||

DPC | 0.1325 | 0.0318 | 0.6581 | 1.8/2 | 0.3114 | 0.2068 | 0.6593 | 2.29/2 |

GB-DP | 0.2655 | 0.1495 | 0.6492 | 2 | 0.1254 | 0.1310 | 0.5979 | 2 |

NDP-Kmeans | 0.0186 | 0.0002 | 0.6995 | 2/0.1 | 2/0.1 | |||

HC-LCCV | ~ | ~ | ||||||

0.2964 | 0.1880 | 0.6559 | 2 | 0.0950 | 0.1269 | 0.5645 | 2 | |

DBSCAN | 0.9913 | 0.9653 | 0.9362 | 3.5/9 | 0 | 0 | 0.7064 | 0.51/1 |

SNNDPC | 4/2 | 0.2891 | 0.1777 | 0.6531 | 3/2 | |||

HC-SNaN | 0/2 | 0/2 |

Algorithm | AMI | ARI | FMI | Arg. | AMI | ARI | FMI | Arg. |
---|---|---|---|---|---|---|---|---|

Wine | Haberman | |||||||

DPC | 0.3965 | 0.2926 | 0.6192 | 0.6 | 0.0082 | 0.0115 | 0.7805 | 1.3 |

GB-DP | 0.3318 | 0.2235 | 0.5832 | 3 | −0.0039 | 0.0072 | 0.7769 | 2 |

NDP-Kmeans | 0.4080 | 0.1671 | 0.3458 | 0.1/3 | 0.0034 | −0.0021 | 0.7771 | 0/2 |

HC-LCCV | 0.3925 | 0.2030 | 0.2942 | ~ | 0.0130 | 0.0315 | 0.7390 | ~ |

0.4288 | 0.2795 | 0.3392 | 3 | −0.0018 | −0.0039 | 0.5508 | 2 | |

DBSCAN | 0.0000 | 0.0000 | 0.5813 | 0.5/2 | 0.0000 | 0.0000 | 0.0000 | 1.97/15 |

SNNDPC | 0.8763 | 0.8983 | 0.9324 | 18/3 | 0.0012 | −0.0114 | 0.5596 | 7/2 |

HC-SNaN | 0/3 | 4/2 | ||||||

Ecoli | Dermatology | |||||||

DPC | 0.4477 | 0.3750 | 0.6483 | 0.4 | 0.5947 | 0.3591 | 0.5665 | 0.4 |

GB-DP | 0.5482 | 0.3661 | 0.5523 | 8 | 0.0165 | 0.0039 | 0.1991 | 6 |

NDP-Kmeans | 0.0300 | 0.0024 | 0.5783 | 0.1/8 | 0.3798 | 0.0882 | 0.1845 | 0.1/6 |

HC-LCCV | 0.4546 | 0.2430 | 0.3389 | ~ | 0.2574 | 0.0472 | 0.2080 | ~ |

0.5889 | 0.3006 | 0.2760 | 8 | 0.1846 | 0.0465 | 0.1104 | 6 | |

DBSCAN | 0.0761 | 0.0458 | 0.4140 | 0.8/4 | 0.5965 | 0.4147 | 0.5380 | 0.99/3 |

SNNDPC | 0.5536 | 0.7066 | 7/8 | 0.7332 | 0.7876 | 9/6 | ||

HC-SNaN | 0.5544 | 7/8 | 0.8420 | 6/6 | ||||

BCW | Mnist | |||||||

DPC | 0.6971 | 0.8029 | 0.9118 | 0.2 | 0.8969 | 0.8636 | 0.9092 | 3.59 |

GB-DP | 0.0009 | 0.0027 | 0.7346 | 2 | 0.0407 | −0.0253 | 0.2874 | 6 |

NDP-Kmeans | 0.0047 | 0.0170 | 0.4325 | 0.2/2 | 0.0184 | −0.0006 | 0.5878 | 0.1/6 |

HC-LCCV | 0.0168 | 0.0059 | 0.3032 | ~ | 0 | 0 | 0 | ~ |

0.0047 | 0.0170 | 0.4325 | 2 | 0.8722 | 0.8965 | 0.8965 | 6 | |

DBSCAN | 0.7630 | 0.8523 | 0.9318 | 0.6/8 | −0.0743 | 0.0025 | 0.4043 | 0.07/5 |

SNNDPC | 0.7910 | 0.8579 | 0.9340 | 10/2 | 0.9233 | 0.8755 | 0.9167 | 16/6 |

HC-SNaN | 9/2 | 8/6 | ||||||

Contraceptive | Page-blocks | |||||||

DPC | 0.0103 | 0.0026 | 0.4336 | 0.3 | 0.0308 | 0.0273 | 2.29/2 | |

GB-DP | 0.4780 | 3 | 0.0015 | 0.0076 | 0.8951 | 2 | ||

NDP-Kmeans | 0.0150 | 0.0016 | 0.5847 | 0.2/3 | 0.0102 | 0.0044 | 0.8104 | 2/0.1 |

HC-LCCV | 0.0293 | 0.0055 | 0.2979 | ~ | 0.0336 | 0.047 | 0.7867 | ~ |

0.0293 | 0.0170 | 0.3644 | 3 | 0.0487 | −0.0105 | 0.6505 | 2 | |

DBSCAN | −0.0017 | 0.0006 | 0.5908 | 1.3/4 | 0.0582 | 0.0304 | 0.8126 | 0.51/1 |

SNNDPC | 0.0093 | 0.0012 | 0.4414 | 21/3 | 0.1272 | 0.053 | 0.6348 | 3/2 |

HC-SNaN | −0.0010 | 0.0003 | 2/3 | 0.9002 | 0/2 |

For parameter settings,

In the experiment, we used AMI [

Firstly, we conducted experiments on eight synthetic datasets, including two three-dimensional datasets and one noise dataset. The first five datasets are common and complex. The detailed information on these datasets is listed in

For instance, as shown in

NDP-Kmeans and GB-DP are algorithms introduced in 2023, whereas SNNDPC and HC-LCCV are relatively new algorithms released within the past five years. Therefore, in this section of results analysis, we selected these four algorithms and presented the clustering outcome graphs for some datasets.

To assess the viability of the HC-SNaN algorithm for high-dimensional data, we conducted comparative experiments with seven other clustering algorithms on eight real-world datasets sourced from the UCI Machine Learning Repository. These real-world datasets vary in scale and dimensions. The clustering results, presented in

The HC-SNaN consistently outperforms the other algorithms on the Wine, haberman, BCW, and mnist datasets, where metrics like AMI, ARI, and FMI rank first, with some metrics significantly surpassing other algorithms. In the Ecoli datasets, HC-SNaN achieves optimal levels of AMI and FMI metrics, with ARI ranking second. For the Dermatology dataset, HC-SNaN attains the highest ARI and FMI metrics, surpassing the other six algorithms. In the Contraceptive dataset, HC-SNaN exhibits the best FMI performance. The Page-blocks dataset shows the metrics rank first with the AMI and ARI, but the FMI compared to the DPC algorithm witch slightly inferior.

Overall, experimental verification confirms that HC-SNaN outperforms the other seven algorithms on most real-world datasets. The results underscore the algorithm’s high clustering performance in discovering various shapes of clustering structures, particularly excelling in handling clustering tasks in dense regions.

In this paper, we introduce a novel neighbor method called shared natural neighbors (SNaN). The SNaN is derived by combining natural neighbors and shared neighbors, and then a graph

However, it is important to note that our algorithm for processing large-scale high-dimensional data may incur substantial time costs, which is an inherent limitation of hierarchical clustering. Therefore, further research is needed to explore the application of this algorithm to massive high-dimensional datasets.

The authors would like to thank the editors and reviewers for their professional guidance, as well as the team members for their unwavering support.

This work was supported by Science and Technology Research Program of Chongqing Municipal Education Commission (KJZD-M202300502, KJQN201800539).

The authors confirm contribution to the paper as follows: study conception and design: Zhongshang Chen and Ji Feng; Manuscript: Zhongshang Chen; Analysis of results: Degang Yang and Fapeng Cai. All authors reviewed the results and approved the final version of the manuscript.

The data that support the findings of this study are available from the corresponding author upon reasonable request.

The authors declare that they have no conflicts of interest to report regarding the present study.