In most of the scientific research feature selection is a challenge for researcher. Selecting all available features is not an option as it usually complicates the research and leads to performance drop when dealing with large datasets. On the other hand, ignoring some features can compromise the data accuracy. Here the rough set theory presents a good technique to identify the redundant features which can be dismissed without losing any valuable information, however, exploring all possible combinations of features will end with NP-hard problem. In this research we propose adopting a heuristic algorithm to solve this problem, Polar Bear Optimization PBO is a metaheuristic algorithm provides an effective technique for solving such kind of optimization problems. Among other heuristic algorithms it proposes a dynamic mechanism for birth and death which allows keep investing in promising solutions and keep dismissing hopeless ones. To evaluate its efficiency, we applied our proposed model on several datasets and measured the quality of the obtained minimal feature set to prove that redundant data was removed without data loss.

Usually, researchers start their research by preparing data in order to start data analysis to discover hidden rules and insights. Before starting this process, mainly in huge datasets, only features-attributes-related to the research should be considered. However, deciding if a feature is necessary or not is not something intuitive, especially when the research is being done by someone out of the domain. For example, a computer or mathematical scientist is doing a research based on medical dataset, for researcher it will not be intuitive to decide which features are really needed to make decision and which are not. Features required to make that decision is called “Minimal Reduct” while other features will be called “Redundant Features”. Redundant and unnecessary features will cause negative impact of the performance in terms of data processing execution time, in addition to inefficient utilization of memory and storage resources. Also, those redundant features will mislead the machine learning process and might result in generating invalid rules and decisions.

When dealing with huge datasets in terms of high number of attributes, the need for a technique to eliminate the unnecessary attributes becomes a must. In other words, when a dataset has too many attributes, there will be a need for a technique to calculate the importance of each attribute. The result should not be only to say if an attribute is necessary or not, however, there might be a gray area in which we need to have figures showing the effect of removing an attribute on the quality of the dataset. Here, and depends on the characteristics of each case, the researcher can decide how much tolerance he can offer. In some cases, in order to reduce the number of the selected attributes the decision might allow 10%-for example-tolerance to the quality of the dataset, while in other cases it might not be possible to accept any tolerance, in such cases, only attributes with zero impact on the quality of the dataset can be dismissed.

Pawlak in 1982 [

Polar Bear Optimization PBO, proposed by Połap et al. [

Due to the importance of reducing number of attributes within affordable processing time, many researches have implemented the rough set theory with the support of heuristic algorithms which allowed finding the minimal reduct without the need to explore all possible alternatives. Previously implemented heuristic algorithms, more or less, share the same concept of having static population generated at the beginning and then keep trying to enhance the founded solutions in each iteration using several techniques for moving and fitness functions. The uniqueness of PBO algorithm is the dynamic death/reproduction technique. This technique allows giving additional chance to good solutions by generating new solutions out of good potentials while keep eliminating solutions that are not progressing well.

A considerable amount of literature focused on either using the rough set techniques to find optimal reduct in various areas, or just utilized the heuristic algorithms in solving NP-hard problems. However, in our summary here we have only listed studies implemented the rough set combined with heuristic techniques:

Chen et al. [

Lazo-Cortés et al. [

The rest of this paper is organized as: In Section 2, an introduction to rough set theory and its basic functionality, in addition to an explanation of the PBO algorithm in general, Section 3 will present how PBO algorithm was customized to be implemented along with rough set, Section 4 describes the experimental results, Section 5 concludes this research and list some open topics for future researches.

We will start this section by explaining the basic concepts of the rough set theory, then we will go through the common characteristics of the heuristic algorithms focusing on the population-based ones, finally will describe the new technique introduced in PBO algorithm.

Introduced by Pawlak [

In rough set the information system (Information table) can be defined as a table of rows and columns. Columns will be called attributes or features, while rows will be called objects (instances), Formally, the information table can be defined as

Information systems can also include one or more features for decision. For example, a doctor can give a decision for each patient-object-based on a list of input features. Information systems include several features, those features have a different influence on the decision. The definition of the decision system can be extended to be

Indiscernibility identifies the objects (instances) that, with regard to certain feature(s), cannot be distinguished from each other. It is simply an equivalence relation between objects. For a decision system _{∝}(

Assume B ⊂ C be the set of condition features, [x] B be the equivalence class of each object x ∈ U by the feature subset B. The approximation of the set of objects

The upper approximation of X is defined by:

The lower approximation of

The accuracy of a rough set is a ratio between 0 and 1. When this value is one, it means that upper and lower approximation match, in this case the set is not rough anymore and called “Crisp Set”. On the other side, when the value is decreasing, this will increase the negative region which means a kind of inconsistency.

The dependency between a set of condition attributes B and a set of decision features R is given by the following formula:

_{B}(

When the value of _{B}(

Polar bear optimization can be classified under population based optimization algorithms proposed by Połap et al. [

Depending on an input variable (population size) the algorithm starts by creating the population elements and distributes them randomly within the scope of search domain, specifying number of bears which will be looking for the target solution, this input parameter should be selected very carefully, although increasing this number will increase the chance of finding the global optimal solution, but on the other hand, having too many bears will have negative impact on the performance.

One step will be moved by all bears at each iteration, however, before moving the new potential location the new solution will be evaluated, if the new position is better than the current one then the bear will move, otherwise it will stay at current location. Formulas

Once in internal loop-local search-is complete and all the population elements had a chance to improve their locations, the global search starts by selecting one of the best solutions-bears-and all elements will try to make a step toward, however, a step will be first evaluated and only if the step will lead to a better solution, then the step will take place. The global search is performed according to the formula

Unlike most of other swarm-based optimization algorithms, objects in PBO are not fixed. First the algorithm starts by generating only 75% of the population, and then after completing each iteration a decision is made whether to produce a new member or remove one based on a randomly generated variable. When reproduction is decided, then two of the best bears will generate a new solution by combining the two solutions, here we assume that combining two good solutions will produce another good solution. However, when the decision is to remove a bear, the bear having the worst value according to the fitness function will be removed from the population after checking that current number of bears will not be less than 50% of the given population size.

Those characteristics of BPO algorithm helped in finding the “optimal” solution using fewer numbers of objects-bears-and relatively less loops. Here, we put the word “optimal” inside quotations because there is no guarantee that the founded solution is the optimal one. This concept is common among all heuristic algorithms. Several researches were performed to find the optimal population size and number of required loops [

In the previous section we explained how the rough set technique can be used to assess the quality of a subset of attributes for a given dataset. Then we have described the basic concepts of PBO algorithm. In this section we will explain our proposed approach of finding the minimal subset of features without the need to explore all possible combinations with the support of PBO algorithm.

The polar bear algorithm as proposed in [

In

According to the above, PBO cannot be applied as is to solve feature selection problems because both algorithms are speaking different languages. Following sub-sections will present our changes to the original PBO to be compatible with RST related problems:

In order to make a step as a local search, we will be flipping over

In Formula

In each loop and after all bears make a step as a local search, we proposed give one more chance to one of the best bears. First, we randomly select one of the top 10% bears and apply the local again (

Once local search is finished where all bears tried to enhance their locations, two of the best bears will be randomly selected in order to produce and new solution. Here we have applied two approaches (

The dependency will be calculated according to formula

The first part of the equation

Our main proposed algorithm along with sub-algorithms is represented in

We have selected eight benchmark datasets from Machine Learning Repository UCI [

The algorithm will need four parameters as an input, population size, number of iterations in addition to two tolerance parameters. Population size and number of iterations should depend on the size of the dataset, especially number of features. The total number of possible solutions increases when number of features increases. According to that and based on our experimental analysis we noticed that the optimal size for population is same number of attributes. Same logic also applies for number of iterations, increasing the number of features will require more trials to find the optimal solution. We noticed that two times number of attributes will be needed as iteration count. We did not see any need to modify those parameters according to number of instances in the dataset.

The

No | Dataset | Samples | Features | Classes |
---|---|---|---|---|

1 | Audiology | 200 | 69 | 24 |

2 | Balance | 625 | 4 | 3 |

3 | Chess | 3196 | 36 | 2 |

4 | Lung | 32 | 56 | 3 |

5 | Mushroom | 8124 | 22 | 2 |

6 | Soylarge | 307 | 35 | 19 |

7 | Soysmall | 47 | 35 | 19 |

8 | Vote | 435 | 16 | 2 |

Dataset | Instances | Features | Min | RSAR | Time | EBR | Time | FSARSR | Time | BPBO | Time |
---|---|---|---|---|---|---|---|---|---|---|---|

Audiology | 200 | 69 | - | - | - | - | - | 13 | 4765.51 | 14 | 14 |

Balance | 625 | 4 | 4 | 4 | 0.492 | 4 | 0.489 | 4 | 1.98 | 4 | <1 |

Chess | 3196 | 36 | - | 31 | 766.52 | 33 | 547.13 | 29 | 3343 | 21 | 275 |

Lung | 32 | 56 | 4 | 5 | 1.22 | 4 | 0.64 | 4 | 128.56 | 4 | 1 |

Mushroom | 8124 | 22 | 4 | 5 | 286.27 | 5 | 225.839 | 4 | 2767.12 | 4 | 47 |

Soylarge | 307 | 35 | 10 | 13 | 23.904 | 10 | 12.21 | 10 | 573.31 | 9 | 6 |

Soysmall | 47 | 35 | 2 | 4 | 0.24 | 2 | 0.40 | 2 | 6.89 | 2 | <1 |

Vote | 435 | 16 | 9 | 10 | 1.73 | 13 | 1.28 | 9 | 45.32 | 8 | <1 |

From

From

As an input parameter, we should decide the needed population size for each dataset, randomly several values of bears were used across all datasets, we have noticed that the in most of the times that the optimal value for the population size is almost same number of attributes. Taking into consideration that selecting the population size correctly has a major impact on the performance. Selecting a very small number might not allow the algorithm finding good results specially when number of attributes is high. On the other hand, selecting high number might result in finding good solutions but will have negative impact on the execution time. In

One more important factor is the iteration count, this parameter also has impact on the possibility of finding the optimal solution and also has affects the total runtime. Similar to population size, this parameter should be selected carefully, and also according to our analysis and as we can see in

The main contribution in this research is the dynamic population and re-production/death strategy, however, to prove that we have also included additional information to each generated solution including the iteration number which produced that solution and the number of steps executed as a local search until reaching the optimal solution, from

Dataset | Samples | Features | Iterations | Best solution | Changed | Duration(s) |
---|---|---|---|---|---|---|

Audiology | 200 | 69 | 140 | 107 | 111 | 12 |

Balance | 625 | 4 | 10 | 1 | 1 | <1 |

Chess | 3196 | 36 | 74 | 56 | 68 | 220 |

Lung | 32 | 56 | 114 | 68 | 78 | 1 |

Mushroom | 8124 | 22 | 46 | 28 | 29 | 46 |

Soylarge | 307 | 35 | 72 | 42 | 70 | 12 |

Soysmall | 47 | 35 | 72 | 21 | 29 | 1 |

Vote | 435 | 16 | 34 | 11 | 33 | 1 |

In our research we have discussed the importance of reducing the size of the dataset before starting any research and how the rough set theory provides a powerful technique to find the minimal dataset's reduct. We also explained how the rough set by itself might not be able to find the minimal reduct as this might require calculating all combinations of attributes which is not possible in large datasets. The heuristic algorithms, especially population-based ones, can play a vital role in solving such NP-Hard problems. In the literature, several heuristic algorithms were utilized along with rough set techniques to find the minimal product. We proposed a binary representation of the Polar Bear Optimization algorithm to find the optimal reduct of a dataset. The polar bear algorithm can only deal with solutions represented as spatial coordinates while the solutions in the rough set are binary array of of selected and unselected features of the dataset, we had to make some amendments to the original functions of PBO to be make it compatible with rough set terminologies. We first represented the objects-bears-in binary format, then we modified the local and global search functions by using binary operators. To evaluate our proposed algorithm, we have selected several datasets from UCI and compared our results with other similar algorithms. Our experimental analysis showed that the dynamic population behavior of our proposed algorithm allowed finding the minimal reduct in a very efficient way in comparing with similar algorithms.

In this research we have implemented AND/OR as binary operators to reproduce new solutions based on two good solutions after each iteration, the results were very good and showed that this was really a promising technique, however, we believe that implementing more advanced binary operators worth evaluation and might even give much better results, this could be a subject for future researches to be evaluated.