The world of information technology is more than ever being flooded with huge amounts of data, nearly 2.5 quintillion bytes every day. This large stream of data is called big data, and the amount is increasing each day. This research uses a technique called sampling, which selects a representative subset of the data points, manipulates and analyzes this subset to identify patterns and trends in the larger dataset being examined, and finally, creates models. Sampling uses a small proportion of the original data for analysis and model training, so that it is relatively faster while maintaining data integrity and achieving accurate results. Two deep neural networks, AlexNet and DenseNet, were used in this research to test two sampling techniques, namely sampling with replacement and reservoir sampling. The dataset used for this research was divided into three classes: acceptable, flagged as easy, and flagged as hard. The base models were trained with the whole dataset, whereas the other models were trained on 50% of the original dataset. There were four combinations of model and sampling technique. The F-measure for the AlexNet model was 0.807 while that for the DenseNet model was 0.808. Combination 1 was the AlexNet model and sampling with replacement, achieving an average F-measure of 0.8852. Combination 3 was the AlexNet model and reservoir sampling. It had an average F-measure of 0.8545. Combination 2 was the DenseNet model and sampling with replacement, achieving an average F-measure of 0.8017. Finally, combination 4 was the DenseNet model and reservoir sampling. It had an average F-measure of 0.8111. Overall, we conclude that both models trained on a sampled dataset gave equal or better results compared to the base models, which used the whole dataset.

Big data analytics is rapidly becoming popular in the business sector. Firms have realized that their big data is an unexploited resource that can help them to reduce costs, increase income, and gain a competitive edge over their rivals. Organizations no longer want to archive these vast amounts of data; instead, they are increasingly converting that information into important insights that may be useful in improving their operations. Big data is digital information that is stored in high volumes, processed at speed, and diverse [

Applying the results from big data analytics is not always as easy as may be expected [

Because problems associated with handling and evaluating large amounts of sophisticated data are common in data analytics, different data analysis approaches have been proposed for reducing the time needed for computation and for reducing the memory needed for the knowledge discovery in databases process [

Despite the advantages of probability sampling, samples are often collected based on the researcher’s judgment, but such samples may not fully represent the population of interest [

Two important sampling methods discussed by Kim et al. [

A real big data system must analyze large volumes of data in a narrow timeframe. Therefore, the majority of studies sample the data to enhance their effectiveness. “Distribution goodness-of-fit plays a fundamental role in signal processing and information theory, which focuses on the error magnitude between the distribution of a set of sample values and the real distribution” [

The key theory in deep learning algorithms is abstraction [

According to Najafabadi et al. [

Deep learning algorithms produce extract representations since more extract illustrations are frequently built based on fewer extract ones. A significant advantage of additional abstract illustrations is that they remain unchanged even after a local transformation of the input data. Studying such unchanged structures is a continuing objective in pattern detection. Apart from remaining intact, such illustrations can also unravel aspects of the dissimilarity in datasets. The actual data employed in AI are usually based on complex connections from numerous sources. For instance, an image can be made up of different bases of variants like light, object outlines, and the materials the object is made of [

The experiments conducted gave a deep dive into data analytics with regards to content-based prediction while sampling the data. The key contributions of the paper are:

We explore the capabilities of two deep learning models: AlexNet and DenseNet.

We assess how effective the sampled data performs compared to using all the data.

We explore two different sampling techniques: sampling with replacement and reservoir sampling.

Kim et al. [

Mahmud et al. [

Bierkens et al. [

Rojas et al. [

Johnson et al. [

These works are summarized in

Researchers | Year | Methodology | Result |
---|---|---|---|

Kim et al. [ |
2019 | Inverse sampling and data integration | Better coverage rates than the alternatives |

Mahmud et al. [ |
2020 | Sampling from a Hadoop cluster | – |

Bierkens et al. [ |
2019 | Zig-zag process | Generate insights that help to focus on different characteristics of the data, without compromising quality in data exploration |

Rojas et al. [ |
2017 | Random sampling | Generate insights based on different characteristics of the data, without compromising quality in data exploration |

Johnson et al. [ |
2020 | Hybrid ROS-RUS (random over-sampling and random under-sampling) | Maximize performance and efficiency |

Many model methods and techniques have been devised for deep content based prediction. Most perform relatively well but there are still some limitations that hinder the accuracy of the models, so that they are not optimal. Most of these limitations apply to almost all the models, so we will discuss them in general terms. The first limitation is the integrity of the dataset used. Data quality has a very important role in the effectiveness of a model, so if the right dataset is not used then the model’s ability to be trained will be hindered. The second limitation is that in sampling, if there are too many outliers in the sampled data that is to be used for training, then the model will not perform well. Hence, data cleaning and transformation are very important. Finally, the choice of model and the method of adjusting the parameters and hyperparameters also have a role in the effectiveness of the model. If done wrong, the model will not be successful.

There are four categories of big data analytics. The first category is descriptive analytics, which produces simple descriptions and visualizations that indicate what has transpired in a given period [

Big data analytics is used for several reasons. First, it can enable business transformation. Generally, managers feel that big data analytics provide significant opportunities for transformation [

Big data analytics can help with national security, as they provide state-of-the-art techniques in the fight against terrorism and crime. There is a wide range of case studies and application scenarios [

To study the influence of sampling techniques on training models, two models were chosen. Since the data consisted of images, models based on a convolutional neural network [

AlexNet was the first convolutional network to used GPUs to boost performance. It was developed by Alex Krizhevsky [

AlexNet won the ImageNet Large Scale Visual Recognition Challenge. The model that won the competition was tuned as follows. ReLU was an activation function. Normalization layers were used, which are not common anymore. The batch size was 128. The learning algorithm was SGD Momentum. It used a large amount of data augmentation, such as flipping, jittering, cropping, color normalization, etc. Models were used in ensembles to give the best results.

DenseNet is a type of convolutional neural network that utilizes dense connections between layers through dense blocks (

This paper uses a dataset composed of key images extracted from the NDPI videos [

The experiments done for this study aim to analyze the influence of sampling techniques on the outcome of models. Sampling techniques are important as they could allow deep learning models to achieve comparable results with less data and spend less time on training the model. The two sampling techniques used in this paper are sampling with replacement and reservoir sampling. We reduced the amount of data using random sampling by removing 50% of the instances. The performance and predictability of the trained model were then compared. The sampling was done as follows:

In sampling with replacement, an instance is chosen randomly from the dataset. This continues until the number of instances selected is half the number in the complete dataset. However, every time an instance is chosen, it is still available for selection. Thus, the same instance can be sampled multiple times.

In reservoir sampling, the samples are chosen without replacement. Thus, each instance can occur at most once in the sampled data. If the complete dataset has

This process of selecting 50% of the data randomly was repeated 10 times. Each sample was then used to develop a deep learning model, using 90% of the data for training and 10% for testing the generated model.

The models were evaluated using the F-measure, which is mainly used in state-of-the-art applications for similar problems and is good for this type of task too. The F-measure takes into account precision and recall. The F-measure, also known as the F1 score, is calculated as follows:

Each model was trained and tested 10 times with each sampling technique. Thus, for each combination of deep learning model and sampling technique, there were 10 experiments. The four combinations are shown in

Sampling technique | |||
---|---|---|---|

Sampling with replacement | Reservoir sampling | ||

Deep learning model | AlexNet | Combination 1 | Combination 3 |

DenseNet | Combination 2 | Combination 4 |

As we can see in

In

As we can see from the results in

Sampling with replacement | Reservoir sampling | |||
---|---|---|---|---|

AlexNet model | DenseNet model | AlexNet model | DenseNet model | |

Base model | 0.807 | 0.808 | 0.807 | 0.808 |

1 | 0.883 | 0.805 | 0.839 | 0.802 |

2 | 0.894 | 0.782 | 0.859 | 0.814 |

3 | 0.878 | 0.780 | 0.853 | 0.823 |

4 | 0.884 | 0.800 | 0.869 | 0.811 |

5 | 0.889 | 0.805 | 0.852 | 0.799 |

6 | 0.898 | 0.805 | 0.866 | 0.808 |

7 | 0.884 | 0.801 | 0.852 | 0.822 |

8 | 0.878 | 0.812 | 0.863 | 0.807 |

9 | 0.884 | 0.816 | 0.842 | 0.801 |

10 | 0.880 | 0.811 | 0.850 | 0.824 |

Average | 0.8852 | 0.8017 | 0.8545 | 0.8111 |

Std. Dev. | 0.007 | 0.012 | 0.010 | 0.009 |

The first experiment conducted was the base case, which used the whole dataset for training and validation, so it was used to compare and test the sampling techniques. AlexNet had an F-measure of 0.807 in the base test. With sampling with replacement, the value was as low as 0.878 and as high as 0.898 with an average of 0.885. These results are excellent compared to the base experiment. With reservoir sampling, the lowest F-measure was 0.839 and the highest was 0.861 with an average of 0.855. These results are also good compared with those for the base experiment. In contrast, with DenseNet, the base experiment has a slightly higher F-measure than the base AlexNet, with a score of 0.808. When sampling with replacement using DenseNet, the lowest score was 0.780 and the highest was 0.816 with an average of 0.802. With reservoir sampling, the lowest score was 0.799 and the highest was 0.824 with an average of 0.811. On average, both techniques with DenseNet are better than the base experiment but not as effective as the AlexNet model.

From the results, it can be seen that the AlexNet model was more accurate by 7.82% and 4.75% for sampling with replacement and reservoir sampling, respectively. However, the DenseNet model did not do as well because its accuracy was very close to that of the base model, with a difference of less than 1% for both sampling with replacement and reservoir sampling. For both models, this research shows that to save the time and cost of analysis and model training, sampling is very useful. Even though not all the data are used, sampling results in the same if not better accuracy than when the original dataset is used.

None of the recent works reported in the literature review used either sampling with replacement or reservoir sampling, both of which gave better results. Some used random sampling, which is less effective compared to other sampling techniques.

There is an endless stream of big data, about 2.5 quintillion bytes every day, from a vast pool of sources such as social media, the internet of things, the cloud, and other databases, just to mention a few. Even though a large amount of data is required for analysis and for deep neural networks to be effectively trained, having too much data is a problem. Although effective results can be obtained with a dataset containing 10,000 items, 10 million is just too many. The time it would take to train a model and analyze this cumbersome amount of data is too long. This research aimed to demonstrate the effectiveness of using only a portion of the data.

The dataset in this research was randomly sampled so that only 50% was used. As there were two deep leaning models (AlexNet and DenseNet) and two sampling techniques (sampling with replacement and reservoir sampling), there were four test combinations. After training, the results were compared to the base models trained on the whole dataset. It was found that the sampled datasets produced the same or higher accuracy during testing. The high accuracy we obtained from the sampled datasets indicates that when analyzing and training deep neural models, it is better to sample the data if there are too much, since it is relatively less time-consuming and achieves effective results.

The researchers would like to thank the Deanship of Scientific Research, Qassim University for funding the publication of this project.