With the wider growth of web-based documents, the necessity of automatic document clustering and text summarization is increased. Here, document summarization that is extracting the essential task with appropriate information, removal of unnecessary data and providing the data in a cohesive and coherent manner is determined to be a most confronting task. In this research, a novel intelligent model for document clustering is designed with graph model and Fuzzy based association rule generation (gFAR). Initially, the graph model is used to map the relationship among the data (multi-source) followed by the establishment of document clustering with the generation of association rule using the fuzzy concept. This method shows benefit in redundancy elimination by mapping the relevant document using graph model and reduces the time consumption and improves the accuracy using the association rule generation with fuzzy. This framework is provided in an interpretable way for document clustering. It iteratively reduces the error rate during relationship mapping among the data (clusters) with the assistance of weighted document content. Also, this model represents the significance of data features with class discrimination. It is also helpful in measuring the significance of the features during the data clustering process. The simulation is done with MATLAB 2016b environment and evaluated with the empirical standards like Relative Risk Patterns (RRP), ROUGE score, and Discrimination Information Measure (DMI) respectively. Here, DailyMail and DUC 2004 dataset is used to extract the empirical results. The proposed gFAR model gives better trade-off while compared with various prevailing approaches.

The strength and volume of data after the increasing use of web is an unceasing repository pool where the resources are products of the internet given for the sake of humans [

The data clustering approaches are deployed for efficient computation in past few years. Various transformation in clustering model are provided to coin the single and multi-objective constraint where the model works prominently under complex environment [

Similarly, other concepts like training data partitioning and the utilization of every data partition with diverse clustering model is provided with an efficient outcomes. The clustering outcomes from diverse baseline models are provided to establish the common relationship among these strategies [

There are an enormous amount of unstructured text documents that are used by the humans every day in their life. It is not so easier to process the text without the accessibility of any automatic approach. The automatic corpus processing model relies on dynamical information grouping in the process of text retrieval [

This research concentrates on proposing an efficient document clustering and text summarization with proper rule generation by data mapping and connecting the relevance of data based on weighted graph model. This model intends to reduce the error during data clustering as the rules are formed in an efficient manner. The text clustering model uses fuzzy classifier for training the data labels produced form the preliminary clusters. The clustering process is utilized to enhance the outcomes of document classification in the earlier stage; however the concept behind classification is to enhance the outcomes of document clustering as an added contribution towards the result.

The automatic data clustering and summarization process over the unstructured text is considered to be a most challenging task [

To select the more appropriate online accessible dataset for data clustering and summarization. Here, DailyMail and DUC 2004 dataset is used to extract the empirical results.

A novel intelligent model for document clustering is designed with graph model and Fuzzy based association rule generation (gFAR).

The empirical analysis is done using MATLAB simulation environment to evaluate metrics like Relative Risk Patterns (RRP), ROUGE score, and Discrimination Information Measure (DMI) respectively.

This work is arranged as follows: Section 2 includes extensive analysis with the background studies related to data clustering; Section 3 explains the novel intelligent model for document clustering using graph model and Fuzzy based association rule generation (gFAR). Section 4 is numerical results and discussion. Section 5 is conclusion with future research directions.

There are various extensive analyses carried out for unsupervised and other clustering approaches termed as topic modelling; however, it provides a lower-dimensional embedding process for exploring the textual datasets. Certain appropriate approaches are concentrated on website search-based outcomes. Some highlighted models are recommended with user interpretation for extracting the outcomes in the descriptive keywords, titles, or phrases that summarize the semantic content of topics and clusters for certain users [

The objective behind the descriptive clustering model towards text datasets is utilized as an information retrieval model [

It facilitates various applicable clustering models to be utilized. The chosen features gain superior users; information over the cluster content process (cause of this concept) which is more challenging. The preliminary concept behind the process is determined based on the clusters which are more likely towards the cluster words, over the cluster center, titles (available) or phrases with certain content alike of words [

With the above-mentioned model, it is not obvious to project how these objective measures are chosen based on the feature labels and lists that function as a descriptive model. The hypothetical condition of these descriptions is essential when it facilitates the user with an appropriate prediction of cluster content. In some cases, predicting the feature set characteristics is provided to determine the feature selection process [

Various techniques for text analysis are done with word embeddings that are produced with a neural network model. The word-based embedding model is represented in a distributive manner of single words [

Here, the novel intelligent model for document clustering is designed with a graph model and Fuzzy based association rule generation (gFAR).

The proposed model introduces the graph-based document clustering and summarization approach that facilitates multi-labelling and determines the number of clusters automatically based on a set of documents. The most essential task during document clustering is to evaluate the relationship among the document. The relationship is learned based on document vectors to measure the document quality. Here, a framework employs a probabilistic model to consider the graph-based model for extracting the data resources. The data clustering model includes the below-given components.

The initial clusters are composed of documents (words) with similar words. The cluster is expanded during the document allocation where p(C_k^i |x) is superior than cluster threshold C_k^i. The extracted words are clustered (words and synonyms). Then, the cluster produces higher similarity that needs to be integrated. For cluster integration, the similarities among the cluster vectors are provided in an efficient manner. The vectors are provided based on average document vectors of similar clusters. To evaluate the similarity, the documents are vectors from document and multi-source of the network. The clusters are initialized based on time complexity with O(n) as words are extracted with association rules O(n) time. It is provided as O(Kn) time where ‘K’ is initial document cluster. The cluster integration with O(K^′2) is computed as the cluster pair. The value of K^′ is smaller than ‘n’ which is scaled with larger documents.

The objects with higher network links are considered for further connection establishment. Here,

Here, _{G}(_{j} neighbourhood. The posterior probability of the document _{i} comes under the graph model _{k} which is expressed as in

From the above _{i}) values are identical. Similarly, _{j} significance is higher; then the probability of visiting _{j} over _{k}. Thus, the function provides the relationship establishment among the data which is expressed as in

The probability of visiting the document (_{d}) is computed by evaluating the relationship among _{j} and _{k}. The average association is established using

The score function is directly proportional to posterior probability where the object belongs to cluster and expressed as in

The process is determined based on the number of objects belongs to the graph by total amount of objects. It is expressed based on the data relationship towards the document as the score is directly proportional to the document relationship during the clustering process.

Here, fuzzy-based clustering approach is utilized for clustering the documents based on the association rule generation among the clusters. The data is partitioned into various clusters where _{l}, (_{l}. The cluster relationship and the association among the mapped data point are considered as fuzzy. The membership function _{i,j} _{i}} is provided with minimal distortion where _{i,j} is membership and it specifies the cluster as shown in

Here, ‘_{i,j} and ‘_{i} _{j}). The fuzzification parameters are provided with appropriate clusters. The mappings of association among the vectors are improved with data point partitioning. The process is initiated with initial cluster centres and forwarded until the stopping criteria are fulfilled. When no two documents (clusters) are similar, then _{i,j} < _{i,j} = 1 and _{i,j} = 0 for all

Here, _{i,j} < _{i,j} = 1. It is adopted to predict the topics and words of the matrix. The clusters are evaluated to attain better clusters which are specified as _{p} + 1. The generation of newer clusters are provided as in

The process is stopped when (||_{j}(_{j}(_{kt}) and ‘

The weighted edges of the documents are measured with the association among the data points. The ROUGE scores are attained and maintained over the outcomes. The process can be merged and the scores are evaluated. The probability of word occurrence over the input clusters is attained from the background knowledge to predict the most appropriate words. For all input clusters, the background corpus is similar other than the chosen cluster. By summarizing the first cluster of the DUC 2002 dataset, 29 clusters use background corpus. Three different ROUGE scores are measured for providing the significance of the model.

Here, a network traffic dataset termed as NSL-KDD dataset is introduced in association with the anticipated model. To analyze the functionality of this method, diverse experimental comparisons are performed. This dataset includes both testing and training sets. The features chosen determine the dataset description with preliminary statistical and contents information towards network connection. The feature size is given as 41. The dataset label includes five diverse network events like a probe, normal, denial of service (DoS), user to root, and remote to local (R2L). Various investigators consider the NSL-KDD dataset as an authoritative benchmark standard in intrusion detection. Thus, the NSL-KDD dataset is considered in this work for valuating the semi-supervised approach. It comprises of various attack patterns that are more appropriate for validating generalization capability. Here, random samples are chosen and remaining samples are utilized as unlabeled data. Here, intrusion detection is considered a multi-class problem. The experimentation is performed in PC. The system configurations are given as Intel i5 processor, Windows 7 OS, 8 GB RAM @3.00 GHz.

There are two diverse features known as numerical and symbolic. The anticipated model deals with symbolic features and values are not distributed randomly. It also triggers negative effect over learning process. To get rid of this problem, data normalization and one-hot encoding approach are used before learning process. The feature values are sequence encoded with 0 and 1. Dimensionality change based on distinctive values of symbolic features. Features like ‘protocol type’, ‘service’, and ‘flag’ are encoded when values are higher than 2. Symbolic features are treated as Boolean type with 1 or 0.

The above

ROUGE toolkit is utilized to compute the performance of proposed gFAR model which includes ROUGE-1, ROUGE-2, and ROUGE-L respectively.

Datasets | Documents | Classes | Instance size | Data length | ||
---|---|---|---|---|---|---|

Total | Total | Max | Average | Min | Average | |

DailyMail | 8094 | 4 | 4203 | 2774 | 2033 | 39 |

Re0 | 1504 | 13 | 608 | 116 | 11 | 69 |

DUC 2004 | 9649 | 165 | 4725 | 131 | 1 | 42 |

WebKB | 4199 | 4 | 1641 | 1050 | 504 | 124 |

Model | DUC 2002 | ||
---|---|---|---|

ROUGE-1 | ROUGE-2 | ROUGE-L | |

URank | 0.490 | 0.220 | - |

Tgraph | 0.485 | 0.230 | - |

Lead-3 | 0.440 | 0.220 | 0.405 |

Cheng’16 | 0.475 | 0.240 | 0.440 |

SummaRunner | 0.475 | 0.230 | 0.425 |

HSSAS | 0.530 | 0.230 | 0.490 |

gFAR | 0.621 | 0.255 | 0.500 |

DUC 2002 | |||
---|---|---|---|

Model | ROUGE-1 | ROUGE-2 | ROUGE-L |

URank | 0.393 | 0.158 | 0.360 |

Tgraph | 0.360 | 0.134 | 0.330 |

Lead-3 | 0.400 | 0.164 | 0.355 |

Cheng’16 | 0.396 | 0.174 | 0.370 |

SummaRunner | 0.417 | 0.158 | 0.395 |

HSSAS | 0.424 | 0.179 | 0.380 |

gFAR | 0.430 | 0.185 | 0.398 |

With the news articles, it is extremely essential for the information that is initiated from the article. It provides better ROUGE metrics that beat the performance of URank, Tgraph, Lead-3, Cheng’16, SummeRunner, and HSSAS respectively. When dealing with the abstractive model, the ROUGE measures overlap with one another with minor variations which are not so essential during readable form. The preliminary ideal behind this ROUGE discrete metrics does not fulfill the increase in readability and quality of produced summary. It is provided to justify the ROUGE scores over the abstractive baselines utilized in this work. Then, the problem related to ROUGE metrics is increased with the numRR/MDIber of referral summaries of the given document. The ROUGE score inflexibility produces reference summary for all documents is much lesser than others when compared to multiple reference summaries. Finally, the proposed gFAR model attains better outcomes when compared to the prevailing models.

It is proven that the outcomes of gFAR have attained better quality outcomes. The outcomes of RR and MDI are compared between two models, i.e., clustering-based discrimination information maximization (CDIM) and gFAR respectively. The significant decision is attained with the choice established among the discrimination values (measurement of discrimination information (MDI) and relative risk (RR)) respectively. The performance of CDIM is poor when compared to gFAR model. The difference between the RR/CDIM and MDI/CDIM is not satisfied when compared to the performance of RR/gFAR and MDI/gFAR respectively. Some simple patterns are sensed from these outcomes. The results of RR are stronger with small ‘

Name | RR/CDIM | MDI/CDIM | RR/gFAR | MDI/gFAR |
---|---|---|---|---|

Pu | 0.763 | 0.608 | 0.820 | 0.620 |

Movie | 0.605 | 0.575 | 0.650 | 0.580 |

Citeseer | 0.440 | 0.405 | 0.450 | 0.410 |

Hitech | 0.430 | 0.485 | 0.445 | 0.490 |

Tr31 | 0.630 | 0.598 | 0.675 | 0.610 |

Cora | 0.365 | 0.340 | 0.385 | 0.350 |

Re0 | 0.412 | 0.440 | 0.425 | 0.450 |

wap | 0.475 | 0.512 | 0.480 | 0.530 |

MDI is utilized to evaluate the term discrimination information for measuring the semantic relationship among the terms. It measures are provided as _{I1ε} and _{I2ε} respectively. It is the quantified discrimination among the distribution divergence among the combined data/category 1 and combined data/category 2 respectively. During the data clustering process, the category 1 and 2 are related with the provided data clusters _{k} and

Here, _{1} and _{2} are prior probabilities of _{k} and

The relative risk of the data cluster _{k} over other clusters _{j} for all clusters _{k} and

Here, _{j}|_{k}) is the conditional probability of _{j} over the cluster _{k}. The range of discrimination information relies on (0 →

Model | Informative | Non-redundancy | Overall (%) |
---|---|---|---|

URank | 23% | 21% | 20% |

Tgraph | 13% | 19% | 16% |

Lead-3 | 15% | 16% | 21% |

Cheng’16 | 19% | 22% | 18% |

SummaRunner | 17% | 22% | 25% |

HSSAS | 20% | 25% | 27% |

gFAR | 28% | 27% | 28% |

DUC2002-ROUGE1 | DUC2002-ROUGE2 | Mail/ROUGE1 | Mail/ROUGE2 | |
---|---|---|---|---|

gFAR+ weight | 53.7 | 25.7 | 42.1 | 19.5 |

gFAR-weight | 44.3 | 21.1 | 39.7 | 15.3 |

This work provides a novel intelligent model for document clustering is designed with a graph model and Fuzzy based association rule generation (gFAR). The unique characteristics of the dataset are maintained (DailyMail and DUC 2002) respectively. This framework is provided in an interpretable way for document clustering. It iteratively reduces the error rate during relationship mapping among the data (clusters) with the assistance of weighted document content. The simulation is done with MATLAB 2016b environment and empirical standards like Relative Risk Patterns (RRP), ROUGE score, and Discrimination Information Measure (DMI) respectively are measured. The error reduction rate of gFAR+weighted value is 53.7 and 25.7 (DUC2002ROUGE 1 and ROUGE 2) and 42.1 and 19.5 (DailyMail ROUGE 1 and ROUGE 2) respectively. The information attained with the document clustering is 28%, 27% non-redundancy, and 28% overall performance. The ROUGE score (1/2/L) of gFAR with DUC 2002 is 0.621, 0.255, and 0.500 respectively. Similarly, the ROUGE score (1/2/L) of gFAR with DailyMail is 0.430, 0.185, 0.398 respectively. The proposed gFAR shows a better trade-off in contrast to prevailing approaches.

In the future, this research is extended by considering how the embedded words are merged with multi-source textual data to enhance the multi-source model and attain better multi-document text clustering to a certain semantic extent.