Capturing the distributed platform with remotely controlled compromised machines using botnet is extensively analyzed by various researchers. However, certain limitations need to be addressed efficiently. The provisioning of detection mechanism with learning approaches provides a better solution more broadly by saluting multi-objective constraints. The bots’ patterns or features over the network have to be analyzed in both linear and non-linear manner. The linear and non-linear features are composed of high-level and low-level features. The collected features are maintained over the Bag of Features (BoF) where the most influencing features are collected and provided into the classifier model. Here, the linearity and non-linearity of the threat are evaluated with Support Vector Machine (SVM). Next, with the collected BoF, the redundant features are eliminated as it triggers overhead towards the predictor model. Finally, a novel Incoming data Redundancy Elimination-based learning model (RedE-L) is built to classify the network features to provide robustness towards BotNets detection. The simulation is carried out in MATLAB environment, and the evaluation of proposed RedE-L model is performed with various online accessible network traffic dataset (benchmark dataset). The proposed model intends to show better trade-off compared to the existing approaches like conventional SVM, C4.5, RepTree and so on. Here, various metrics like Accuracy, detection rate, Mathews Correlation Coefficient (MCC), and some other statistical analysis are performed to show the proposed RedE-L model's reliability. The F1-measure is 99.98%, precision is 99.93%, Accuracy is 99.84%, TPR is 99.92%, TNR is 99.94%, FNR is 0.06 and FPR is 0.06 respectively.

The systems that are compromised over the network are termed as botnets or zombies. Botnets are attack platform in distributive nature [

Recently, botnet detection is considered as the most active state of research. Some prevailing approaches of botnet detection are done with statistical feature computation with packet and traffic flow; whereas in some cases protective packet inspection is performed [

However, these approaches do not discuss the network patterns in linear and non-linear manner. The network patterns are composed of higher level and lower level features; in some cases the high level features are the most influencing features that trigger the attack activities over the network. The lower level features are not so severe and it does not influence the network communication. The linearity and non-linearity of the data patterns are analyzed efficiently by Support Vector Machine (SVM). The data patterns are validated by SVM and maintained over the bag of features (BoF). This process reduces the redundancy of data patterns to handle the computational overhead in an efficient manner. The process of redundancy elimination is carried out by Redundancy Elimination-based learning model (RedE-L) to categorize the features for predicting botnet. The theoretical complexity is reduced with the adoption of this model and maintains the prediction accuracy. The proposed model is evaluated and compared with benchmark dataset to show the efficiency of the model. The comparison is done with conventional SVM, C4.5, RepTree and so on, where the metrics like MCC, accuracy, detection rate are computed to show the model reliability. The major research challenge is the prediction of botnet dataset (benchmark dataset) and the analysis of botnet features.

The work is further organized as follows: Section 2 depicts the extensive review towards the background studies on botnet detection; Section 3 discusses the proposed methodology and the functionality of the model; Section 4 is numerical results and discussion of the proposed model; Section 5 is conclusion and future research directions.

The significance of predicting bot over the network is extensively reviewed using conventional approaches. Here, the relationship among the host towards the bot and the consequences are analyzed. Saad et al. [

Garcia et al. [

Some statistical features of packets and the flow are analyzed for detecting C2 channels by author et al. [

This research work concentrates on providing an efficient approach to analyse the incoming data patterns over the distributed environment. The data patterns are either linear or non-linear, that support vector machine with regression analysis. The redundancy elimination is performed to eliminate the unnecessary features that leads to over-fitting and improves generalization.

It is a botnet traffic dataset acquired from CTU University, in 2011. The target of this dataset captures the real botnet traffic from the background and normal traffic. It is composed of 13 captures (scenarios) of various botnet samples. Over these scenarios, certain malware is executed with several protocols and actions. Each scenario is captured in pcap file which is composed of packets and of three different traffics. These files are processed to attain the information type known as weblogs, NetFlows and so on. The uni-directional netflows specifies the traffic and assigns the label. The bi-directional NetFlow is composed of various advantages. It shows the malware used to capture the number of infected computers over these scenarios.

The data collected from the online source needs to be normalized before performing the training and testing process. The linear transformation (host resource utilization) is expressed as in

From _{i} is the original data used during host resource utilization. The maximal and the minimal values of _{i} is provided with _{min} and _{max} respectively.

SVM is an efficient classification approach that gains huge popularity which is mainly considered during the classification problem. However, the analysis with the huge dataset includes both linear and non-linear data (dependent and independent variables). Therefore, these kind of data needs to be analyzed for identifying the data patterns that triggers the attack activities over the network. The regression analysis is adopted over this model to measure the linearity of the data. The function is expressed as _{i}, _{i}) for the construction of training points. The input vectors _{i} is formed with finite sequential measurements where the output vectors are composed of _{1}, _{2}, …, _{i} into training and testing outputs. The provided training data with ‘_{i}, _{i}}, _{i} ∈ ^{n} and output data _{i} ∈

Here, ‘

Here, _{r} is empirical risk analysis. The data pattern prediction functionality relies on hyper-parameters like ‘

Here, slack variables

The linear regression is used to provide the assignment of provided class labels to all samples ^{′}^{′} based on the class label score. When _{2} is a sample matrix with ^{th} row in correspondence to sample _{i} which is composed of weighted patterns from SVM which is augmented as _{i} = [1 _{i1}_{i2}…_{ic}]^{T}. When θ_{1} is coefficient matrix with similar regression model the features (data patterns) are expressed as in ^{′}^{′} is class label matrix; θ_{1} is regression coefficient matrix; _{1} is expressed as in

For any sort of unknown samples ^{′}^{′} is an augmented score vector, the output of all labels are computed using ^{th} column matrix θ_{1}. The relationship among the labels are considered during the assignment of class labels with samples. The data patterns and the corresponding labels are evaluated to allocate the class labels independently with a threshold level. The class assignment of these class labels to the provided samples ^{′}^{′} with the evaluation of data patterns gives optimal results, i.e., the output of the labels are determined not only by the label score but also with other corresponding factors. The experimentation carried out with this process shows correlation label towards the given input in a regressive manner.

The attributes of the dataset are translated to real-number as this model is considered as numerical data. The kernel function selection is performed for the manipulation of regression problem. The kernel parameters generate diverse outcomes like accuracy and prediction performance. The kernel functionality is used to map the originality of the data (linearity and non-linearity of the data) towards the higher dimensional space which is transformed to linear data. There are four diverse kernel functions such as Radial Bias Function (RBF), Linear function, Polynomial function, and sigmoid function. This work considers only the linear functionality which is expressed as in

The linear function is considered as a special case which is determined to be an optimal parameter adjustment during computation process. However, non-linearity is measured using Kernel Principal Component Analysis. Here, the original linear PCA is transformed to non-linear data distributions. While adopting kernel PCA, the input data is non-linearly mapped towards high-dimensional feature space. The mapping function is expressed as in

The mapping function _{i}, _{j})). The kernel over PCA is kept consistent with kernel computation. Here, the kernel functionality is depicted using Radius Basis Function (RBF). The samples are mapped using kernel matrix where the ^{th} and ^{th} of element of the kernel matrix is expressed as in

It is explicitly impossible to analyse the feature space as the mapping function is unknown (See

Here, the centralized version of mapped samples

The mapping data is unknown and cannot be computed explicitly. Moreover, the data is projected directly using the kernel trick. The data is mapped to attain the features (data patterns). The kernel among the mapped training sample and the mapped data (incoming new data (

Generally, lower level features are features with less information while the higher-level features hold some semantic information. It helps the researchers to have extensive insight towards the feature selection process.

Here, Linear/Non-linear Support Vector Machine (SVM) classifiers learn a model for predicting the image class from BoF based on kernels representation [

The redundancy elimination-based learning model (RedE-L) is for eliminating the redundancy identified during feature (data pattern) analysis. It is performed both in linear and non-linear way, to reduce the computational overhead identified during the tracing of botnet. The flow diagram of proposed model is shown in

It is a backward elimination process which is adopted for SVM (linear and non-linear data). Here, the first feature elimination performed with ^{′}^{′} is matrix index. The same process is applied for all the features. Then, the re-training of SVM is performed with the elimination of features using

It is equivalent to maximal feature elimination only by satisfying the condition given in

By adopting

In case of non-linear kernel, the choice of Gaussian kernel is expressed as ^{2}),

This section discusses the outcomes for earlier prediction of botnet with appropriate feature analysis. Here, the simulation is carried out in MATLAB environment. The target of this work is to detect bot with better accuracy, reducing false-negative and false positive. Some features like multiple systems submit as many requests as possible to a single Internet computer or service, overloading it and preventing it from servicing legitimate requests.

List | Features | Chosen features |
---|---|---|

_{1} |
TCP window size | |

_{2} |
Average TTL | ✓ |

_{3} |
DNS percent | |

_{4} |
Destination port quantity | ✓ |

_{5} |
TCP percent | |

_{6} |
Source privileged ports | |

_{7} |
Frame length | ✓ |

_{8} |
Local clustering coefficient | |

_{9} |
Percent outros | ✓ |

_{10} |
Out degree weight | |

_{11} |
Source not privileged ports | ✓ |

_{12} |
Out degree | |

_{13} |
Nodes among centrality | |

_{14} |
In degree weight | ✓ |

_{15} |
Protocol quantity | |

_{16} |
In degree Eigen vector centrality | ✓ |

_{17} |
ICMP | ✓ |

_{18} |
UDP | ✓ |

From the above

Methods | F1 | Precision | Accuracy | TPR | TNR | FNR | FPR |
---|---|---|---|---|---|---|---|

RF | 0.9770 | 96.49 | 97.86 | 97.78 | 96.39 | 1.02 | 3.62 |

SVM | 0.9755 | 95.34 | 95.99 | 96.87 | 95.12 | 0.13 | 4.89 |

O-SVM | 0.9935 | 97.38 | 97.86 | 92.45 | 99.56 | 0.70 | 0.65 |

ISVM | 0.9923 | 97.26 | 97.58 | 96.54 | 95.45 | 0.83 | 0.74 |

FR-SVM | 0.9861 | 96.76 | 98.78 | 98.74 | 98.54 | 1.45 | 1.34 |

OIFRSVM | 0.9892 | 95.56 | 97.68 | 97.56 | 98.65 | 0.09 | 0.07 |

SVM (Linear/Non-linear) | 0.9998 | 99.93 | 99.84 | 99.92 | 99.94 | 0.06 | 0.06 |

In

Methods | Accuracy | DR | FAR | MCC | Time to build the model (s) | Time taken (s) |
---|---|---|---|---|---|---|

REPTree | 97.98 | 0.9842 | 0.00009 | 0.9926 | 36.93 | 0.104 |

RTree | 97.19 | 0.9830 | 0.00011 | 0.9914 | 10.45 | 0.045 |

C4.5 | 97.16 | 0.9705 | 0.00008 | 0.9912 | 5.85 | 0.043 |

DNN | 97.11 | 0.9730 | 0.00011 | 0.9910 | 28.91 | 0.045 |

SMO | 97.75 | 0.9769 | 0.00029 | 0.9212 | 890.38 | 0.168 |

DT | 97.91 | 0.9626 | 0.00015 | 0.9758 | 7.03 | 0.091 |

SVM (Linear/Non-linear) | 0.9998 | 0.9956 | 0.00008 | 0.9930 | 4.56 | 0.038 |

In this research, a novel approach is proposed for analyzing the linearity and non-linearity of the data patterns using regression analysis to predict the bot over the distributed environment. The multi-objective constraints like detection rate, prediction accuracy, attacks features are resolved using the proposed SVM model. Here, higher-level and lower-level features are considered where the higher-level features are chosen and the lower-level features are eliminated using Redundancy Elimination-based learning model (RedE-L). The features are maintained over the feature set (bag of features). The robustness of the model is analyzed based on the prediction of traces of bot. The redundancy of the incoming patterns are measured with features (linear and non-linear manner) to reduce the over-fitting issues and gives better generalization outcomes. The experimental analysis shows that this model provides higher accuracy when compared to conventional SVM, C4.5, RepTree and so on. The detection rate of this proposed model is 0.9956 which is higher when compared to other prevailing approaches. Similarly, the FAR of the model is 0.00008 which comparatively lesser than other approaches. The time taken for the execution process is 0.038 s and the time for building the model is 4.56 s. The accuracy of the proposed model is 99.98%, MCC is 0.993 and DR is 0.9956 respectively. Based on this, the optimal features are chosen to measure the bot incidence over the distributed environment in an efficient manner. Thus, in future, better optimization approaches need to be adopted for resolving the multi-objective constraints. The major research constraint is the lack of recent benchmark dataset. The construction of recent dataset is highly solicited.