The dimensionality of data is increasing very rapidly, which creates challenges for most of the current mining and learning algorithms, such as large memory requirements and high computational costs. The literature includes much research on feature selection for supervised learning. However, feature selection for unsupervised learning has only recently been studied. Finding the subset of features in unsupervised learning that enhances the performance is challenging since the clusters are indeterminate. This work proposes a hybrid technique for unsupervised feature selection called GAkMEANS, which combines the genetic algorithm (GA) approach with the classical kMeans algorithm. In the proposed algorithm, a new fitness function is designed in addition to new smart crossover and mutation operators. The effectiveness of this algorithm is demonstrated on various datasets. Furthermore, the performance of GAkMEANS has been compared with other genetic algorithms, such as the genetic algorithm using the Sammon Error Function and the genetic algorithm using the Sum of Squared Error Function. Additionally, the performance of GAkMEANS is compared with the stateoftheart statistical unsupervised feature selection techniques. Experimental results show that GAkMEANS consistently selects subsets of features that result in better classification accuracy compared to others. In particular, GAkMEANS is able to significantly reduce the size of the subset of selected features by an average of 86.35% (72%–96.14%), which leads to an increase of the accuracy by an average of 3.78% (1.05%–6.32%) compared to using all features. When compared with the genetic algorithm using the Sammon Error Function, GAkMEANS is able to reduce the size of the subset of selected features by 41.29% on average, improve the accuracy by 5.37%, and reduce the time by 70.71%. When compared with the genetic algorithm using the Sum of Squared Error Function, GAkMEANS on average is able to reduce the size of the subset of selected features by 15.91%, and improve the accuracy by 9.81%, but the time is increased by a factor of 3. When compared with the machinelearning based methods, we observed that GAkMEANS is able to increase the accuracy by 13.67% on average with an 88.76% average increase in time.
Feature selection aims to identify the minimum size subset of features that is necessary and sufficient to stay close to the original class distribution. Irrelevant or redundant features whose presence does not improve the proximity to the original class distribution can be dropped [
Most of the feature selection techniques are dedicated to supervised learning. Unsupervised feature selection is known to be a more difficult problem due to the unavailability of class labels that could facilitate efficient exploration of the search space of the problem. Few studies have tackled this problem. The literature offers four types of machine learning approaches to deal with the feature selection: namely wrapper, filter, embedded, and hybrid [
The mathematical formulation of the unsupervised feature selection problem leads to a hard combinatorial optimization problem (see
Genetic Algorithm (GA) is a powerful populationbased search metaheuristic that has demonstrated its effectiveness and superiority in solving the famous hard combinatorial optimization problems the Quadratic Assignment Problem (QAP) [
In this paper, we propose a hybrid technique, called GAkMEANS, which combines the genetic algorithm approach with the classical kMeans algorithm. In order to efficiently explore the huge search space of the UFS problem and to guide the search to the potential solutions, we modeled the UFS as a multicriteria optimization problem. For this purpose, we designed a new fitness function based on the kMeans algorithm. Our fitness function stands out from the others existing fitness functions in the literature because it is properly designed with respect to the characteristics of the optimization problem modeling UFS itself, rather than adapting existing fitness functions of the literature. The main contributions of this research are as follows:
We proposed a hybrid technique for unsupervised feature selection called GAkMEANS, which combines the genetic algorithm (GA) approach with the classical kMeans algorithm to find a lowdimensional subset of features. The proposed technique is experimentally evaluated using a set of datasets that vary in dimensionality.
The initial population plays a major role in the effectiveness of any genetic algorithm performance. Thus, we investigated how to generate an initial population to enhance the performance of genetic algorithms that solve the UFS problem. In this paper, we carefully generate a set of initial individuals to improve the search and converge quickly to a satisfactory nearoptimal solution.
Designing an efficient fitness function for a genetic algorithm is a substantial and challenging task. In the proposed algorithm, a new fitness function relying on the kRMeans algorithm is designed to assess the quality of a subset of features, and specially to direct the exploration of the search space of the problem toward potential (nearoptimal) solutions.
We designed a new smart crossover operator to ensure the exchange of the bestselected feature subsets to create a better subset of features.
We designed a new suitable mutation operator based on the feature similarity measure to diversify the search and maintain the fitter individual.
Our approach, GAkMEANS, can be categorized as a hybrid filterwrapper approach as it assesses a subset of features based on their similarities independently from any learning algorithm (as in the filter approach). However, it stands out from the traditional filter approaches by allowing the consideration and the assessment of several candidate solutions instead of only the first one selected based on the similarity coefficients (as in the wrapper approach).
The designed hybrid technique is implemented and tested using five different datasets. The performances of GAkMEANS have been compared with the results of other genetic algorithms, namely the genetic algorithm using the Sammon Error Function, and the genetic algorithm using the Sum of Squared Error Function. Moreover, the performance of GAkMEANS is compared against the stateoftheart statistical approaches for unsupervised feature selection techniques.
Experimental results show that GAkMEANS consistently selects the subsets of features that result in better classification accuracy compared to others.
Realworld data suffers from the existence of many irrelevant, redundant, and noisy features. Removing these features by feature selection reduces the computational cost in terms of both time and space [
The similarity measure is a way of measuring how data samples are related or close to each other. Data samples are related if their corresponding features are related. Without loss of generality, we can assume that all features are vector samples from real numbers. Some of the most popular similarity measures are described below:
Euclidean distance is the basis of many measures of similarity and dissimilarity [
Euclidean distance is only appropriate for data measured on the same scale. The correlation coefficient is (inversely) related to the Euclidean distance between standardized versions of the data.
Another commonly used distance measure is the Manhattan distance. The Manhattan distance between features X and Y is the sum of the absolute differences between the corresponding elements of the two features, see
The following is a definition of the correlation between the vectors X and Y [
where
In this paper, the designed fitness function is based on the kRMeans algorithm, where the kRMeans algorithm is a relaxed version of kMeans that can rely on any feature similarity measure.
Genetic algorithm (GA) is a wellknown populationbased metaheuristic algorithm that is inspired by the biological evolution processes [
In this section, some of the stateoftheart unsupervised feature selection methods [
SimilarityBased methods assess feature importance by their ability to preserve data similarity. These include Laplacian Score [
SPEC realizes that both supervised and unsupervised feature selection methods select the features that separate the data points into subsets according to their similarity, in most cases. It proposes that, analyzing the spectrum of the graph induced by the data points pairwise similarity is a generic framework in both settings. SPEC shows that many algorithms, including Laplacian Score, are just special cases of the proposed generic framework.
Sparselearningbased methods select the features by removing the least useful ones. They use a spare regularizer to weigh the importance of a feature and force the coefficient of the nonuseful features to be very small or exactly zero, and hence, to be removed. Examples of sparselearningbased methods are Multi Cluster Feature Selection (MCFS) [
In the experimental evaluation section, we compared the performance and results obtained by the proposed method GAkMEANS with the results of these statistical methods.
There are some recent works that proposed unsupervised feature selection algorithms. For example, Miao et al. [
Some methods for solving the unsupervised feature selection problem include a metaheuristic artificial intelligent (AI) strategy, and a machine learning (ML) strategy such the methods mentioned above. Metaheuristics in artificial intelligent are intelligent techniques intended to explore the search space of the UFS optimization problem to find a nearoptimal solution in a reasonable time as only some promising regions of the search space are explored to determine a nearoptimal solution. For this purpose, some metaheuristics have been designed to deal with the UFS optimization problem, which will be discussed in the following sections.
Metaheuristics are a type of generic algorithm that can be used to solve a wide range of optimization problems. Recently, few metaheuristics have been designed to deal with the UFS optimization problem. For example, evolutionary approaches as the binary bat algorithm [
Ant colony optimization (ACO) is one of the most wellknown swarmbased algorithms. Dadaneh et al. [
Ramasamy et al. [
There are few numbers of metaheuristics algorithms that have been proposed to deal with the UFS optimization problem. Metaheuristic algorithms including the genetic algorithm have proven their efficiency in solving optimization problems, through a balance between exploration and random search [
To the best of our knowledge, there are many unsupervised feature selection methods reported in the literature, but few of them discuss unsupervised feature selection using genetic algorithms.
Shamsinejadbabki et al. [
Abualigah et al. [
Agrawal et al. [
Saxena et al. [
As mentioned in the introduction, GA algorithms proved their effectiveness and superiority to solve the wellknown combinatorial hard optimization problem such as the Traveling Salesman Problem (TS) and the Quadratic Assignment Problem (QAP). On the other hand, to the best of our knowledge, a few numbers of UFS approaches in the literature are utilizing genetic algorithms and are still inadequate and unsatisfactory which also creates a motivation to research in this field. Furthermore, most genetic algorithms for solving UFS problems rely on the basic classical operators and generic fitness functions which do not consider the structural aspect and characteristics of the UFS optimization problem. Thus, there is a need for designing a more appropriate genetic algorithm for UFS by developing more adequate fitness function and biologicalinspired operators. The modeling of the unsupervised feature selection as an optimization problem enables us to design a new proper fitness function to the UFS problem that naturally guides the search towards feasible and potential region of solutions. Our fitness function is distinct from other fitness functions of the literature because it is properly designed with respect to the characteristics of the UFS problem, rather adapting existing fitness functions of the literature. Similarly, this modelling allowed us to introduce proper smart crossover and mutation operators to explore the huge search space of the UFS problem rather adapting existing classical known operators as the famous onepoint crossover. At the end, these new different algorithmic contributions to UFS deserve to be investigated in order to know their impacts on several UFS datasets of the literature.
We model the unsupervised feature selection as a MultiCriteria Optimization problem. More formally, given a similarity measure between the features of data samples, and let C_{1}, …, C_{k} be a set of k clusters where each cluster is a grouping of similar features according to the similarity measure. Let P_{ij} be the similarity measure of the two features i and j (i < j). The best subset of features should fulfill the following three objective functions (given in points 1, 2, 3) subject to the two constraints (given in point 4):
1. 
1. Minimize the number k of selected features, LB and UB are a given lower and upper bounds of k. 
The huge search space of the optimization problem will be explored using the designed hybrid technique detailed in the following section. The three objective functions are handled explicitly via the fitness function, the crossover and mutation operators designed for this purpose.
In this section, we describe the designed GA concepts for the UFS problem. We start by encoding a possible solution of UFS, then we introduce the simple variant of the kMeans algorithm which will be used in defining the fitness function of an individual and generating the initial population of the hybrid technique GAkMEANS. We describe how to calculate an individual’s fitness function value, and the newly designed crossover and mutation operators for UFS. Some definitions are introduced below to illustrate the proposed algorithm of the hybrid technique.
An individual I is a set of integer numbers where each number represents a feature, and the cardinality of a set I indicates the number of features in the individual I.
The first step in any GA algorithm is to create the first generation of the population. The population is a set of individuals; each individual consists of a subset of k features (LB <= k <= UB). Basically, the genetic algorithms generate the initial population randomly.
In this paper, we carefully improve a set of initial individuals in order to improve the search and to converge quickly to a satisfied nearoptimal solution.
I = {} 
Given a subset of k features, kRMeans aims to partition the features of F into k clusters where intraclusters similarity is maximized. kRMeans is a Relaxed variant of kMeans algorithm that is simply one iteration of kMeans. Let I be a subset of k features of F, we create for each feature I a cluster. kRMeans assigns every feature i of F not in I to the cluster whose the center feature is the nearest feature to i. kRMeans relies on the concept of eccentricity to compute the center of a cluster as seen in
Create a cluster for each feature in I 
kRMeans algorithm can use any feature similarity measure (feature correlation: Person, Euclidian distance, symmetric uncertainty, etc.).
The fitness function assesses the quality of an individual, a subset of k features (k <= n) where n set of all features. For UFS, we defined a new fitness function relying on the kRMeans algorithm. The fitness value of an individual is the ratio of intracluster similarities to intercluster similarities. More formally, the fitness function of an individual I is computed as follows:
This fitness function is defined from the objective functions 2 and 3 of the optimization problem described in
The roulette wheel selection is used at this step to select the potentially useful solutions for recombination based on the fitness values of the individuals.
The crossover operator explores the search space of the UFS solution by building a new offspring from two individual parents. UFSX is a new smart crossover operator that we designed to generate a new offspring by getting closer to the optimal solution. Let’s I1 and I2 be two individuals. UFSX crossover builds a new offspring in O (k_{1} k_{2}) as shown in
k1 = Size of first individual 
The new crossover UFSX creates offspring with a smaller number of features and selects features from the candidate set that are less similar. Thus, UFSX creates offspring that are closer to the optimal solution.
The mutation operator aims to diversify the search in order to explore different promising regions of the search space of the problem. We designed a new mutation operator UFSM that involves exchanging a randomly selected feature i of the individual I with the less similar feature in FI. The time complexity of UFSM operator is O (k n) where n is the number of features in the original set F, and k is the number of features included in the mutant individual, as shown in
Algorithm UFSM Operator 

Select randomly a feature i from I 
The final step is to replace the old population with the new population based on the selection strategy. GAkMEANS algorithm terminates at the end to announce the best solution in hand when the predetermined number of generations is reached.
We implemented GAkMEANS algorithm for UFS problem using Spyder (Python 3.8) software. Next, we explain the used datasets, the experiments settings, and the performance measures.
Experiments have been conducted on 5 datasets of various types. The datasets are handwritten image dataset USPS [
Dataset  #instances  #features  #classes  {LB, UB} 

USPS  9298  256  10  {10, 100} 
Isloet  1560  617  26  {10, 250} 
COIL20  1440  1024  20  {10, 250} 
Lung  203  3312  5  {30, 400} 
Orlraws10P  100  10304  10  {30, 1000} 
There are some userspecified parameters that have been learnt from our comprehensive experiments by changing the experimental parameters and carrying out multiple runs of the algorithm with different parameters’ values including the population size, probability of crossover operation, probability of mutation operation, and number of iterations. Population size is the number of individuals in a population. The probability of crossover operation referred to the probability of crossover to occur in each iteration. For each individual, a randomly generated floatingpoint value is compared to the crossover probability, and if it is less than the probability, crossover is performed; otherwise, the offspring are identical to the parents. The probability of mutation operation is the probability of mutation occurrence for the individual. For each individual, a randomly generated floatingpoint value is compared to the mutation probability, and if it is less than the probability, mutation is performed. The number of iterations, also called number of generations, refers to the number of cycles before the termination.
Parameters  Value 

50  
0.5  
0.02  
50 
Results are evaluated in terms of the number of selected features, accuracy, and time. Classification accuracy is measured as the percentage of the samples that were correctly classified. The accuracy of each method was measured by using naïve Bayes classifier with 10crossvalidation. To evaluate the goodness of the final set of features, the accuracy of the original data with full set of features was compared with the accuracy of the data containing only selected features.
This section presents the results of the experiments on GAkMEANS and analyses its performance in terms of classification accuracy, run time, and the number of selected features. In addition, the performance of GAkMEANS has been compared with the results of other genetic algorithms. Furthermore, the performance of GAkMEANS is compared with the stateoftheart unsupervised feature selection techniques.
The performance of GAkMEANS is shown in
Datasets  Results without FS  Results with FS  

F  ACC  Fs  Reduction #features  ACC  Improvement 
STDDV (ACC)  Best solution (fitness 

256  0.7857  70  72.66%  0.8107  3.18%  0.023  0.2122  
617  0.7942  150  75.69%  0.8306  4.58%  0.024  0.298  
1024  0.8229  70  93.16%  0.8539  3.77%  0.017  0.2724  
3312  0.8571  196  94.08%  0.9113  6.32%  0.035  0.068648  
10304  0.95  398  96.14%  0.96  1.05%  0.017  0.490431 
Based on the results, we observed that GAkMEANS has the ability to select a small number of features. After applying GAkMEANS algorithm on the given datasets and the dimension size had become smaller, we found that, the selected subset of features by our proposed method had a higher classification accuracy than all features’ accuracy. The positive effect of a small number of features is clearly visible if we look at the results of the accuracy as shown in
Datasets  Results without FS  Results with FS  

F  ACC  Criterions  GAkMEANS  GASEF  GASSEF  
256  0.7857  Fs 
70 
112 
79 

617  0.7942  Fs 
150 
153 
103 

1024  0.8229  Fs 
70 
159 
102 

3312  0.8571  Fs 
196 
300 
200 

10304  0.95  Fs 
398 
1681 
2031 
% Improvement in GAkMEANS 
% Improvement in GAkMEANS 


Reduction #Features  Improvement in ACC  Decrease in Time  Reduction #Features  Improvement in ACC  Decrease in Time  
USPS  37.50%  5.24%  −0.22  11.39%  14.12%  −1.10 
Isloet  1.96%  13.86%  0.82  −45.63%  24.40%  −2.71 
COIL20  55.97%  4.55%  0.97  31.37%  8.14%  −1.05 
Lung  34.67%  1.20%  0.98  2.00%  1.37%  −1.90 
Orlraw10P  76.32%  2.00%  0.97  80.40%  1.00%  −8.32 
When comparing with the genetic algorithm using the Sammon Error Function, GAkMEANS on average is able to reduce the size of the features by 41.29% and improve the accuracy by 5.37% with reduced time by 70.71%. When comparing with the genetic algorithm using the Sum of Squared Error Function, GAkMEANS on average is able to reduce the size of the subset of selected features by 15.91% and improve the accuracy by 9.81% with an increase in time by 3 factors. Based on the result, we observed that our proposed algorithm consumes less time than the Sammon Error function, but more time than the Sum of Squared Error Function. According to the results of applying these different algorithms, we found that the proposed algorithm overcomes the comparative methods by selecting a small number of features that provide superior classification accuracy in a reasonable period of time.
The proposed GAkMEANS gave better performance based on the classification accuracy and time overall datasets. Our algorithm achieves a higher level of dimensionality reduction by selecting a smaller number of features than other methods. The superior performance of GAkMEANS could be attributed to several factors. A new fitness function is designed to guide the search toward the promising regions involving relevant solutions. Furthermore, a new smart crossover and mutation operators are used instead of classical operators to increase the diversity of the population with greater fitness and provide a mechanism for escaping from a local optimum. Unlike other genetic algorithms, which randomly generate the initial population, in GAkMEANS algorithm we developed a procedure to carefully generate the initial individuals; each individual regroups in a cluster a subset of similar features intended to improve the search and to converge quickly to a satisfied nearoptimal solution.
In this section, we conducted experiments to compare the performance of GAkMEANS and the stateoftheart ML based unsupervised feature selection methods. Four ML methods are compared with the proposed algorithm; similaritybased methods: Laplacian Score (LS) and SPEC, sparse learningbased methods: MCFS, and NDFS. We specified the number of selected features to be uniform in all methods for each dataset, according to the obtained number of selected features from GAkMEANS in
Dataset  Algorithm  Accuracy  Time 

USPS (Number of selected features = 70)  GakMEANS  0.8107  2244.56 s 
Laplacian Score  0.4466  13.162 s  
SPEC  0.8083  174.732 s  
MCFS  0.7482  157.823 s  
NDFS  0.8052  4023.61 s  
Isolet (Number of selected features = 150)  GAkMEANS  0.8306  1883.44 s 
Laplacian Score  0.6032  0.537 s  
SPEC  0.7462  6.869 s  
MCFS  0.6089  2.485 s  
NDFS  0.7365  7.215 s  
COIL20 (Number of selected features = 70)  GAkMEANS  0.8539  737.316 s 
Laplacian Score  0.6472  0.287 s  
SPEC  0.6194  10.320 s  
MCFS  0.7090  2.070 s  
NDFS  0.74375  8.796 s  
Lung (Number of selected features = 196)  GAkMEANS  0.9173  706.095 s 
Laplacian Score  0.9162  0.062 s  
SPEC  0.9064  0.301 s  
MCFS  0.9171  0.546 s  
NDFS  0.8817  48.196 s  
Orlraws10P (Number of selected features = 398)  GAkMEANS  0.96  4331.49 s 
Laplacian Score  0.95  0.0311 s  
SPEC  0.90  0.3472 s  
MCFS  0.95  3.3068 s  
NDFS  0.94  837.561 s 
Based on
Based on the results, we found that GAkMEANS can increase the accuracy by 13.67% on average with 88.76% average increase in time. We observed that, for Lung and Orlraws10P datasets, since the number of features is challenging and greater than the number of samples, the performance of GAkMEANS is better than other algorithms, but the improvement rate is small. On the other hand, for USPS, Isolet, and COIL20 where the number of features is less than in Lung and Orlraws10P, and less than the number of samples, GAkMEANS performs better than other algorithms.
In this paper, a new genetic algorithm combined with the kMeans algorithm for unsupervised feature selection is proposed to find a lowdimension subset of features. The proposed method is evaluated by using a set of datasets varying in dimensionality. The experiments demonstrate the effectiveness of the proposed method where the number of features is reduced, and the classification accuracy is better than using all features. It also overcomes the other comparative methods by obtaining better classification accuracy with a small number of selected features based on five common datasets. GAkMEANS can significantly reduce the size of the subset of selected features by 86.35% on average, which leads to an increase of the accuracy by 3.78% on average compared to using all features. When comparing with the genetic algorithm using the Sammon Error Function, GAkMEANS on average can reduce the size of the subset of selected features by 41.29% and improve the accuracy by 5.37% with reduced time by 70.71%. When comparing with the genetic algorithm using the Sum of Squared Error Function, GAkMEANS on average can reduce the size of the subset of selected features by 15.91% and improve the accuracy by 9.81% with an increase in time by 3 factors. When compared with the machinelearning based methods, we observed that GAkMEANS is able to increase the accuracy by 13.67% on average with 88.76% average increase in time. Finally, according to this study, we can conclude that carefully designed genetic algorithms can lead to competitive techniques for solving the UFS problem.
The authors extend their appreciation to the Deanship of Scientific Research at Imam Mohammad bin Saud Islamic University for funding and supporting this work through Graduate Students Research Support Program.
The authors received no specific funding for this study.
The authors declare that they have no conflicts of interest to report regarding the present study.