Traditional clustering algorithms often struggle to produce satisfactory results when dealing with datasets with uneven density. Additionally, they incur substantial computational costs when applied to high-dimensional data due to calculating similarity matrices. To alleviate these issues, we employ the KD-Tree to partition the dataset and compute the K-nearest neighbors (KNN) density for each point, thereby avoiding the computation of similarity matrices. Moreover, we apply the rules of voting elections, treating each data point as a voter and casting a vote for the point with the highest density among its KNN. By utilizing the vote counts of each point, we develop the strategy for classifying noise points and potential cluster centers, allowing the algorithm to identify clusters with uneven density and complex shapes. Additionally, we define the concept of “adhesive points” between two clusters to merge adjacent clusters that have similar densities. This process helps us identify the optimal number of clusters automatically. Experimental results indicate that our algorithm not only improves the efficiency of clustering but also increases its accuracy.

Clustering is an approach of identifying inherent groups or clusters in high-dimensional data based on certain similarity metrics [

Traditional clustering analysis techniques can be divided into three main categories: hierarchical, partitioning, and density-based clustering methods [

In addition, the Density Peaks Clustering [

To avoid calculating the similarity matrix and reduce the time consumption of calculating data point density, we use KD-Tree to partition the entire dataset. Inspired by the voting rule, we can automatically discover high-density cluster center points using the voting rule. Integrating the concept of K-nearest neighbors, we can automatically assign data points to the high-density center of their K-nearest neighbors. Finally, we propose the Density Clustering Algorithm based on KD-Tree and Voting Rules (KDVR) algorithm. This unique algorithm can self-identify cluster centers without the need for manually setting cluster numbers and delivers exceptional clustering results with relatively low time complexity.

The primary contributions of this paper can be summarized as follows:

(1) To improve the problems of traditional density clustering algorithms that require manually setting the truncation distance and takes a long time to calculate the similarity matrix, we adopted the KD-Tree technology to divide the entire dataset and calculated its Domain-adaptive density, thus avoiding the calculation of the similarity matrix and manual setting of the truncation distance, which improved the running efficiency of the algorithm.

(2) Drawing inspiration from the voting rule, we treat each data point as a voter, allowing it to automatically choose the point with the highest density among its K-nearest neighbors. At the same time, we introduce the concept of ‘Adhesive Points’ to merge adjacent and density-similar data clusters, thereby solving the problems of traditional density clustering algorithms that require manual selection of cluster centers and have poor processing effects on unevenly dense and complex-shaped datasets.

(3) We have elaborated on the principles and implementation process of the KDVR algorithm and confirmed the efficacy of the proposed algorithm through experiments. The outcomes of the experiments demonstrate that the algorithm can efficiently handle unevenly dense and complex-shaped datasets.

The rest of this paper is structured as follows:

This section offers an in-depth exploration of the DPC algorithm. At the same time, we also provide some key definitions, such as domain adaptive density, the principle of KD-Tree, and the specific steps for building KD-Tree and performing nearest neighbor searches, to help readers better understand the algorithm proposed by us.

DPC is an extremely efficient clustering algorithm that assumes density centers satisfy two characteristics: (1) they are encircled by data points with comparatively lower density; (2) they are distant from other data points with comparatively higher density.

The original DPC algorithm typically employs truncated and Gaussian kernels to calculate the density of data points, employing different computational techniques depending on the characteristics of the dataset. Suppose

The density of data point

The density of data point

Here, _{c}

The relative distance for the sample point with the highest density is outlined in

The relative distance for other sample points is defined as shown in

In DPC, the sample point with the highest greatest is identified as the cluster center, and its relative distance value is set to the maximum value. The selection of other density peaks as cluster centers requires them to meet two criteria: a high local density (

Density peak clustering has been proven to be an efficient and intuitive algorithm capable of identifying datasets of diverse shapes. However, it still has certain limitations:

(1) In the DPC clustering algorithm, the strategy of selecting density centers is based on a decision graph and is determined manually. The criteria include the local density of data points within the dataset and their relative distance from other data points with higher density. However, when the dataset exhibits a wide range of densities, the influence of local density often outweighs that of relative distance, potentially leading to the misidentification of cluster centers and the loss of sparse clusters. For example, in the T4 dataset shown in

(2) Once the cluster centers have been identified, the assignment strategy for other data points in the DPC clustering algorithm is relatively straightforward. The DPC algorithm assigns these data points to clusters that are denser than them. However, this assignment strategy fails to fully consider the impact of complex data environments. As shown in

To overcome the issue of suboptimal clustering results for data with vary density distributions (VDD), balanced distributions, and multiple-domain density maxima in DPC algorithms. Chen et al. [

In the article, to overcome the problem of cluster loss in sparsity in VDD data for the DPC algorithm, a domain-adaptive density estimation technique known as KNN density is introduced. This method is utilized to identify density peaks in various density regions. By leveraging these density peaks, we can effectively identify the clustering centers in both dense and sparse regions, thereby effectively addressing the issue of sparse cluster loss. Assuming

_{i}, satisfies the following conditions: (1)

Here, _{i} represents the set of

The KD-Tree [

Algorithm 2 illustrates the nearest neighbor search process of the KD-Tree. It exhibits remarkable efficiency in handling substantial amounts of data.

In many clustering algorithms, a linear search is typically employed for neighbor search, which involves calculating the distance between each data point in the dataset and a reference data point in a sequential fashion. However, when dealing with large-scale datasets, this approach can become time-consuming and resource-intensive.

The algorithm can be categorized into four parts: (1) Constructing a KD-Tree; (2) Voting to select noise points and core points; (3) Identifying potential cluster centers; (4) Adhesive points identification and cluster merging.

KNN has been proven to be an effective density estimation technique. However, for high-dimensional and large-scale datasets, the time complexity of obtaining the KNN set of a specific data point using traditional linear search methods exponentially grows with the size of the data. Therefore, this algorithm proposes the use of the KD-Tree nearest neighbor search method to obtain the KNN cluster of data points. The construction of the KD-Tree is outlined in Algorithm 1, and the nearest neighbor search process is described in Algorithm 2. Compared to the time complexity of O(^{2}) for linear search, the time complexity of using KD-Tree nearest neighbor search is O(

Suppose we have a set of two-dimensional data points X = {A, B, C, D, E, F, G, H, I, J}. Firstly, we calculate the variance for each dimension of the data and select the

_{i}, For any point _{i}, if it satisfies the condition: _{i}.

Just like shown in

For the given point

If

As shown in

Let _{1}, _{2}, …, _{m}} denote the number of potential centers, where

To facilitate the amalgamation of clusters characterized by closely interconnected boundaries and similar densities, we propose the introduction of two novel concepts: the degree of cluster affiliation, denoted as bs, and the notion of cohesive points.

The degree of affiliation represents the extent to which a point belongs to its respective cluster. For any given point

Here,

Where

Just as shown in

Let _{1}, _{2}, …, _{m}} denote the set of potential centers. Let

Detailed steps of the density-based clustering algorithm based on KD-Tree and the voting rule are as follows:

Utilizing the Cth3 dataset as a demonstration, let us illustrate the clustering process of this algorithm. The Cth3 dataset consists of 1146 sample points, assuming there are 4 clusters in the correctly classified scenario. The procedure of the algorithm is outlined in

Just like shown in figures:

The KDVR algorithm primarily consists of the construction and K-nearest neighbor search of the KD-Tree, identification of noise points and potential clusters, as well as the recognition of cohesive points and merging potential clusters. If the dataset contains

In the noise point identification process, each point casts a vote to the point with the maximum KNN density among its K-nearest neighbors, resulting in an O(^{2}). Therefore, the time complexity of the KDVR algorithm can be summarized in

It can be seen that the time complexity of this algorithm is largely influenced by the identification of cohesive points and the construction of the KD-Tree, which is O((^{2}), where

To confirm the efficiency of the KDVR algorithm, experiments were performed on 6 synthetic datasets [

The implementation of the KDVR algorithm, DPC-KNN algorithm, DPC algorithm, DBSCAN algorithm, DPC-CM algorithm, and NNN-DPC algorithm was done using Python on the PyCharm 2019 platform. In the experiment, all datasets were preprocessed with min-max normalization. For the DPC-KNN algorithm, the number of clusters must be determined beforehand, and the parameter “

ARI, or Adjusted Rand Index, measures the similarity between two clusters by considering different permutations of cluster labels. This metric is more accurate in reflecting the similarity between clustering results, and it also takes into account the measurement error caused by random probability, which increases the credibility of comparison results. The formula for calculating ARI is as follows:

where _{RI} represents _{RI}

NMI is the normalized version of MI, which measures the similarity between two clustering results to calculate the score. The formula for calculating NMI is as follows:

Here, _{ij} is the number of samples that belong to cluster _{i} is the number of samples within cluster _{j} is the number of samples within cluster

ACC stands for clustering accuracy and is utilized to gauge the similarity between the clustering labels generated and the true labels provided by the data. The calculation formula is as follows:

where _{i} and _{i} represent the clustering labels obtained for data _{i} and the true labels of data x respectively, and

Comparison of the KDVR, DPC-KNN, DPC, DBSCAN, DPC-CM and NNN-DPC algorithm on 6 synthetic datasets with various characteristics. The selected synthetic datasets are presented in

Datasets | Number of samples | Features | Clusters |
---|---|---|---|

Cth3 | 1146 | 2 | 4 |

Aggregation | 788 | 2 | 7 |

D31 | 3100 | 2 | 31 |

D6 | 1400 | 2 | 4 |

Ls3 | 1735 | 2 | 6 |

T7 | 8000 | 2 | 9 |

Datasets | Metric | DPC | DBSCAN | DPC-KNN | NNN-DPC | DPC-CM | KDVR |
---|---|---|---|---|---|---|---|

Cth3 | ACC | 0.7731 | 0.8961 | 0.8551 | 0.8032 | ||

ARI | 0.5072 | 0.6113 | 0.7863 | 0.5611 | |||

NMI | 0.6947 | 0.7868 | 0.9062 | 0.6842 | |||

Aggregation | ACC | 0.9962 | 0.9835 | 0.9848 | 0.9949 | 0.9962 | |

ARI | 0.9935 | 0.9779 | 0.9730 | 0.9951 | 0.9920 | ||

NMI | 0.9896 | 0.9681 | 0.9669 | 0.9898 | 0.9884 | ||

D31 | ACC | 0.9674 | 0.8323 | 0.8171 | 0.9664 | 0.9732 | |

ARI | 0.9345 | 0.5636 | 0.5942 | 0.9431 | 0.9630 | ||

NMI | 0.9571 | 0.8411 | 0.8626 | 0.9588 | 0.9459 | ||

D6 | ACC | 0.6150 | 0.9607 | 0.6171 | 0.9779 | 0.7455 | |

ARI | 0.4129 | 0.9132 | 0.4294 | 0.9311 | 0.5428 | ||

NMI | 0.4653 | 0.8873 | 0.6744 | 0.9459 | 0.6305 | ||

Ls3 | ACC | 0.6334 | 0.9389 | 0.9440 | 0.6906 | ||

ARI | 0.2335 | 0.8226 | 0.8865 | 0.4063 | |||

NMI | 0.4653 | 0.8898 | 0.9023 | 0.5454 | |||

T7 | ACC | 0.7726 | 0.8419 | 0.8149 | 0.9676 | 0.8235 | |

ARI | 0.5667 | 0.6881 | 0.7006 | 0.7123 | 0.9663 | ||

NMI | 0.7459 | 0.7382 | 0.7934 | 0.9422 | 0.7703 |

For complex shape data sets with multi-scale and cross-surrounding characteristics, the KDVR algorithm fully considers the impact of local density peaks on the clustering result, by improving the clustering center strategy, dividing the complex shape data set into multiple sub-clusters centered on the maximum points of local density. Then, the bonding point merges these sub-clusters to identify the clusters of complex shapes. Compared to other DPC improvement algorithms, the KDVR algorithm is more adept at handling such data sets. The experimental results on the cth3, D6, Ls3, and T7 data sets also confirm this. Although the clustering result of the KDVR algorithm is not optimal when processing data sets with uneven density, it is still close to the optimal level. This is because when the density variance is too significant, the decision result of the clustering center will be disturbed, leading to an inability to obtain the best clustering result. However, other algorithms cannot achieve good clustering results on both types of data sets simultaneously.

To further contrast the clustering performance of the KDVR, DPC-KNN, DPC, DBSCAN, DPC-CM, and NNN-DPC algorithms on real-world datasets, ten diverse UCI datasets were selected for experimentation. These datasets vary in terms of their data size and dimensions, allowing for a better demonstration of the KDVR algorithm’s efficacy in handling high-dimensional data.

Datasets | Number of samples | Features | Clusters |
---|---|---|---|

Zoo | 101 | 16 | 7 |

Ecoli | 336 | 8 | 8 |

Seed | 210 | 7 | 3 |

Vote | 435 | 16 | 2 |

Vowel | 871 | 3 | 6 |

Wine | 178 | 13 | 3 |

Cancer | 683 | 9 | 2 |

WBC | 683 | 9 | 2 |

Pendigits | 3498 | 16 | 10 |

WDBC | 569 | 30 | 2 |

Algorithm | Zoo | Vote | ||||
---|---|---|---|---|---|---|

ACC | ARI | NMI | ACC | ARI | NMI | |

DPC | 0.6337 | 0.4972 | 0.7224 | 0.8757 | 0.5921 | 0.515 |

DBSCAN | 0.8812 | 0.9326 | 0.8968 | 0.8 | 0.465 | 0.387 |

DPC-KNN | 0.6733 | 0.6043 | 0.7815 | 0.8989 | 0.6354 | 0.5534 |

NNN-DPC | 0.7632 | 0.6798 | 0.8136 | 0.6442 | 0.5531 | |

DPC-CM | 0.7245 | 0.6379 | 0.7633 | 0.8459 | ||

KDVR | 0.8874 | 0.6018 | 0.5202 | |||

DPC | 0.6726 | 0.4913 | 0.5151 | 0.7451 | 0.5092 | 0.583 |

DBSCAN | 0.8661 | 0.4963 | 0.4947 | 0.4007 | 0.1558 | 0.1011 |

DPC-KNN | 0.7232 | 0.5463 | 0.5524 | 0.7218 | 0.2951 | 0.5672 |

NNN-DPC | 0.8551 | 0.5802 | 0.7463 | 0.5123 | 0.5703 | |

DPC-CM | 0.7702 | 0.5619 | 0.5641 | 0.7214 | 0.5232 | 0.5779 |

KDVR | 0.5666 | |||||

DPC | 0.9 | 0.7341 | 0.7238 | 0.882 | 0.6723 | 0.7104 |

DBSCAN | 0.8571 | 0.4097 | 0.4839 | 0.8146 | 0.5292 | 0.5905 |

DPC-KNN | 0.8905 | 0.7127 | 0.7047 | 0.8876 | 0.6885 | 0.7181 |

NNN-DPC | 0.8806 | 0.7233 | 0.7226 | 0.9119 | 0.7142 | 0.7444 |

DPC-CM | 0.8903 | 0.6969 | 0.7223 | |||

KDVR | 0.9048 | 0.7432 | 0.7142 | |||

DPC | 0.5368 | 0.4934 | 0.4404 | 0.7236 | 0.4934 | 0.4404 |

DBSCAN | 0.8816 | 0.8362 | 0.7456 | 0.8829 | 0.7456 | |

DPC-KNN | 0.6036 | 0.5054 | 0.4562 | 0.7961 | 0.6774 | 0.7103 |

NNN-DPC | 0.8613 | 0.7901 | 0.7112 | 0.8662 | 0.7517 | 0.7371 |

DPC-CM | 0.7274 | 0.6624 | 0.5767 | 0.8069 | 0.6987 | 0.6641 |

KDVR | 0.7469 | |||||

DPC | 0.8742 | 0.7776 | 0.763 | 0.5231 | 0.4822 | 0.4374 |

DBSCAN | 0.8413 | 0.7384 | 0.7922 | 0.4687 | 0.356 | 0.3662 |

DPC-KNN | 0.7761 | 0.7161 | 0.6223 | 0.4824 | 0.4519 | 0.3896 |

NNN-DPC | 0.8229 | 0.7653 | 0.7125 | 0.5294 | 0.4844 | 0.4635 |

DPC-CM | 0.8895 | 0.5409 | 0.4955 | 0.4714 | ||

KDVR | 0.9053 | 0.8211 |

The number of nearest neighbors,

This algorithm uses a KD-Tree to partition the entire dataset, and replaces the traditional density calculation method with the K-nearest neighbor density, thus avoiding the construction of similarity matrices and significantly reducing the computational cost. At the same time, the use of KNN density helps to identify clusters with uneven density. In addition, this algorithm introduces the concept of belonging degrees of sample points to each cluster and defines the concept of stickiness points between similar clusters on this basis. By using the concept of stickiness points, regions with similar densities are connected, merging potential clusters with high similarity, thereby effectively identifying clusters with complex morphologies and automatically determining the optimal number of clusters. Experiments have shown that this algorithm is not merely efficient but also possesses favorable clustering effects.

We noticed that the key to obtaining ideal clustering results in density-based clustering algorithms lies in correctly selecting the density clustering center. As long as there are enough data points in a region, the density-based clustering algorithm can recognize it as a cluster, which indicates that the cluster center must be the point with the densest concentration within its KNN, which is similar to the principle of voting election. This observation inspires us: to use a KD-Tree to partition the entire dataset, so that each data point can quickly find its KNN set, calculate the KNN density of each point, and then vote for the data point with the largest density among its KNN. In this way, we can find the point with the largest density in the local area, without having to calculate the similarity matrix, thereby improving the efficiency of the algorithm. At the same time, since the calculation of the KNN density only considers the distance of each point’s K-nearest neighbors, this facilitates the task of locating cluster center points within the local environment, rather than being affected by the density distribution of the entire dataset, making the algorithm able to identify sparse clusters in datasets with uneven density. In addition, through experiments, we also found that the K-nearest neighbors of points between dense and adjacent clusters have many points belong to other clusters, indicating that the density is similar and adjacent at the boundaries of the two clusters. Therefore, we merge such two clusters, ultimately enabling the algorithm to identify those clusters with complex shapes.

Although this algorithm improves efficiency to some extent, this mainly depends on the selection of an appropriate

The authors of this article would like to thank all the members involved in this experiment, especially Mr. Hu, Mr. Ma, Mr. Liu, and Professor Wang, who contributed valuable insights and provided useful references and suggestions for the subsequent review of the experiment.

This work was supported by National Natural Science Foundation of China Nos. 61962054 and 62372353.

The authors confirm contribution to the paper as follows: study conception and design: Hui Du, Zhiyuan Hu; data collection: Zhiyuan Hu; analysis and interpretation of results: Hui Du, Zhiyuan Hu. Depeng Lu; draft manuscript preparation: Zhiyuan Hu. Depeng Lu. All authors reviewed the results and approved the final version of the manuscript.

All the artificial datasets used in this article are sourced from:

The authors declare that they have no conflicts of interest to report regarding the present study.