Due to the advancements in information technologies, massive quantity of data is being produced by social media, smartphones, and sensor devices. The investigation of data stream by the use of machine learning (ML) approaches to address regression, prediction, and classification problems have received considerable interest. At the same time, the detection of anomalies or outliers and feature selection (FS) processes becomes important. This study develops an outlier detection with feature selection technique for streaming data classification, named ODFST-SDC technique. Initially, streaming data is pre-processed in two ways namely categorical encoding and null value removal. In addition, Local Correlation Integral (LOCI) is used which is significant in the detection and removal of outliers. Besides, red deer algorithm (RDA) based FS approach is employed to derive an optimal subset of features. Finally, kernel extreme learning machine (KELM) classifier is used for streaming data classification. The design of LOCI based outlier detection and RDA based FS shows the novelty of the work. In order to assess the classification outcomes of the ODFST-SDC technique, a series of simulations were performed using three benchmark datasets. The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches.

With the advancement in information technologies, huge volume of information is produced via sensor devices, social networks, and mobile phones. Streaming data is a series of information that arrives at the system in a changing and continuous manner. Streaming data has certain features like timely ordered, huge, potentially infinite, and rapidly changing in length [

Recently, anomaly detection in real-time computer networks become increasingly difficult task because of the continuous increase of high dimensional, high-volume, and high-speed data streams, where ground truth data is unavailable [

Since none of the earlier works have focused on the outlier detection with feature selection (FS) process for streaming data classification, this study develops an outlier detection with FS technique for streaming data classification, named ODFST-SDC technique. Primarily, streaming data is pre-processed in two ways namely categorical encoding and null value removal. In addition, Local Correlation Integral (LOCI) is used which is significant in the detection and removal of outliers. Besides, red deer algorithm (RDA) based FS approach is employed to derive an optimal subset of features. Finally, kernel extreme learning machine (KELM) classifier is used for streaming data classification. The design of LOCI based outlier detection and RDA based FS shows the novelty of the work. For assessing the enhanced streamlining data classification results of the ODFST-SDC technique, a wide range of simulations were carried out against three benchmark datasets.

Benjelloun et al. [

Heigl et al. [

Gu et al. [

In this study, a new ODFST-SDC model has been presented for streaming data classification. The proposed ODFST-SDC technique incorporates different processes namely pre-processing, LOCI based outlier removal, RDA based FS, and KELM based classification.

At the initial stage, the streaming data is pre-processed in such way that the categorical encoding and null value removal process is carried out. Secondly, the RDA is applied to indicate an optimal subset of features and finally, KELM model is utilized for streaming data classification.

During the outlier removal process, the pre-processed streaming data is passed into the LOCI model which effectively detects and removes outliers. LOCI calculates a counting neighborhood to nn nearest observation, whereas the radius has equivalent to outermost observations. In the counting neighborhood, all observations have a sampling neighborhood of that the size has defined as the alpha input parameter. LOCI returns an outlier score dependent upon the standard deviation of sampling neighbourhood is named as multi-granularity deviation factor (MDEF). The LOCI operation is helpful to outlier detection from clusters and other multi-dimensional fields [

Large sampling neighborhood: To all points and counting radius, the sampling neighborhood was chosen that huge sufficient to comprise sufficient samples. It can be select α = 1/2 from every exact computation, and it is classically utilized α = 1/16 from LOCI to robustness

Full-scale: The MDEF value is studied to a wide range of sampling radii. In other terms, the maximal sampling radius is rmax ≈ α^{−1}RP (that equals to maximal counting radius of RP). The minimal sampling radius

Standard deviation based flagging: The point can be denoted as outlier when some

Once the outliers are removed, the next step is to derive an optimum feature subset using RDA. The RDA is similar to other meta-heuristics, starts with an arbitrary population that is red deers (RDs) counterpart. An optimum RDs from the population is selected and dubbed the “male Red Deers (MRD)”, but the remainder was dubbed “hinds” [

Afterward, the fitness of all the members from the population is computed based on

MRDs are attempt for boosting their grace by roaring under this stage. Thus the outcome, the roaring procedure can succeed/fail. When the main function of neighbors is MRD, it can be exchanged by the forecasting when it can be superior to the preceding MRD. The subsequent data is offered for keeping males up to speed on its status. The subsequent formula is presented:

where

The fighting procedure is modeled by the succeeding 2 mathematical equations:

The 2 novel solutions established by fighting procedure are

where

The count of hinds of a harem is computed as:

where

The count of hinds from the

It can be picked a harem at arbitrary (call it k) and allows the male commander for mating with

where

where

The RDA aims to choose a set of features from any dataset with the size of

where

At the final classification process, the chosen features are applied to the KELM model to classify streaming data. Extreme learning machine (ELM) is established as a general classification model which needs minimum human interference [

where

where

The resultant weight vector of ELM is evaluated by minimizing the function provided as:

relating to

where

whereas

In this section, the experimental validation of the ODFST-SDC model is carried out using three benchmark datasets namely CICIDS 2018 (

In this study, the CICIDS 2018 dataset with three classes (with 1000 instances each) is employed namely benign, FTP-Brute Foreca, and SSH-Brute force. After outlier removal process, the number of instances under benign, FTP-Brute Foreca, and SSH-Brute force classes becomes 992, 874, and 987. Besides, the ODFST-SDC model has offered 21 chosen features out of 80 features. They are Dst Port, Protocol, Flow Duration, Tot Fwd Pkts, Tot Bwd Pkts, TotLen Fwd Pkts, TotLen Bwd Pkts, Fwd Pkt Len Max, Fwd Pkt Len Min, Fwd Pkt Len Mean, Fwd Pkt Len Std, Bwd Pkt Len Max, Bwd Pkt Len Min, Bwd Pkt Len Mean, Bwd Pkt Len Std, Flow Byts/s, Flow Pkts/s, Flow IAT Mean, Flow IAT Std, Flow IAT Max, Flow IAT Min.

Besides, the KDD-Cup 1999 dataset holds samples under four classes namely normal, denial of service (DoS), Probe, and root to local (R2L). Each class label comprises a total of 750 instances. Once the outliers are removed, the number of instances becomes 638, 575, 748, and 720 namely normal, DoS, Probe, and R2L classes respectively. Besides, the number of chosen features is 5 out of 41 and they are protocol_type, service, flag, src_bytes, dst_bytes, and land. Finally, the NSL-KDD dataset also includes four classes namely normal, DoS, Probe, and R2L. Each class label comprises a total of 750 instances. After outlier removal process, the number of instances under normal, DoS, probe, and R2L classes becomes 624, 671, 540, and 440 respectively. In addition, a set of 17 features are elected out of 42 features. They are protocol_type, service, flag, src_bytes, dst_bytes, land, wrong_fragment, urgent, hot, num_failed_logins, logged_in, num_compromised, root_shell, su_attempted, num_root, num_file_creations, num_shells.

The confusion matrices created by the ODFST-SDC model under three distinct dataset is portrayed in

A comprehensive precision-recall curve inspection of the ODFST-SDC model under three distinct datasets is portrayed in

Measures | Applied Dataset | ||
---|---|---|---|

CICIDS | KDDCup 99 | NSL-KDD | |

Accuracy | 95.27 | 98.99 | 94.20 |

Precision | 95.99 | 99.03 | 94.59 |

Recall | 95.34 | 99.02 | 94.67 |

F1-Score | 95.42 | 99.02 | 94.51 |

ROC Score | 99.09 | 99.79 | 98.40 |

Methods | Accuracy (%) |
---|---|

ODFST-SDC | 95.27 |

Gaussian model | 91.06 |

HID-DNN model | 92.03 |

NID-GB model | 84.25 |

CS-PSO model | 75.51 |

Fuzzy C-means | 95.19 |

Methods | Accuracy (%) |
---|---|

ODFST-SDC | 98.99 |

CIDD-ADODNN model | 95.90 |

OC-SVM model | 91.80 |

NB model | 89.60 |

Gaussian model | 91.10 |

DNN-SVM model | 92.00 |

Methods | Accuracy (%) |
---|---|

ODFST-SDC | 94.20 |

SVC-KPCA model | 93.40 |

MOPF model | 91.74 |

DLANID-TAL model | 89.22 |

DLANID-FAL model | 85.42 |

AMGA2-NB model | 94.00 |

Along with that, the MOPF and SVC-KPCA models have portrayed certainly enhanced

In this study, a novel ODFST-SDC approach was developed for streaming data classification. The proposed ODFST-SDC technique incorporates different processes namely pre-processing, LOCI based outlier detection, RDA based FS, and KELM based classification. Firstly, the streaming data is pre-processed in such way that the categorical encoding and null value removal process is carried out. Next, the RDA is applied to choose an optimal subset of features and finally, KELM method is utilized for streaming data classification. The application of LOCI based outlier removal and RDA based election of features help to considerably improve the streaming data classification outcomes. For assessing the enhanced streamlining data classification results of the ODFST-SDC technique, a wide range of simulations were carried out against three benchmark datasets. The experimental results reported the promising outcomes of the ODFST-SDC technique over the recent approaches with maximum accuracy of 94.20% on NSL-KDD dataset. In future, concept drift problem can be handled in the streaming data classification process.