Data mining and analytics involve inspecting and modeling large pre-existing datasets to discover decision-making information. Precision agriculture uses data mining to advance agricultural developments. Many farmers aren’t getting the most out of their land because they don’t use precision agriculture. They harvest crops without a well-planned recommendation system. Future crop production is calculated by combining environmental conditions and management behavior, yielding numerical and categorical data. Most existing research still needs to address data preprocessing and crop categorization/classification. Furthermore, statistical analysis receives less attention, despite producing more accurate and valid results. The study was conducted on a dataset about Karnataka state, India, with crops of eight parameters taken into account, namely the minimum amount of fertilizers required, such as nitrogen, phosphorus, potassium, and pH values. The research considers rainfall, season, soil type, and temperature parameters to provide precise cultivation recommendations for high productivity. The presented algorithm converts discrete numerals to factors first, then reduces levels. Second, the algorithm generates six datasets, two from Case-1 (dataset with many numeric variables), two from Case-2 (dataset with many categorical variables), and one from Case-3 (dataset with reduced factor variables). Finally, the algorithm outputs a class membership allocation based on an extended version of the K-means partitioning method with lambda estimation. The presented work produces mixed-type datasets with precisely categorized crops by organizing data based on environmental conditions, soil nutrients, and geo-location. Finally, the prepared dataset solves the classification problem, leading to a model evaluation that selects the best dataset for precise crop prediction.

Blockchain technology could revolutionize agriculture by addressing product fraud, traceability, price gouging, and consumer mistrust. The author [

Agriculture has a long tradition in India. India was recently ranked second in global agricultural output. For example, forestry and fisheries related to agriculture generated roughly half of all jobs and 16.6% of GDP in 2009. The agricultural sector’s contribution to India’s GDP is declining. Crop production is agriculture’s primary source of revenue [

The majority of farmers in underdeveloped countries continue to employ centuries-old farming techniques. These methods do not ensure a high yield per acre. One of the numerous issues with conventional agriculture is that farmers choose crops based on market demand rather than the productivity of their land. Crop recommendation is a strategy that assists farmers in determining which crops will produce the most yields per hectare. A crop recommendation system, also known as a prediction system, is the art of anticipating crop yields to optimize productivity prior to harvesting. It is typically done several months in advance. Because crop recommendation systems entail processing vast amounts of soil, fertilizers, and geographical and meteorological data, machine learning (ML) approaches are utilized to handle this overwhelming data efficiently. ML-based systems can take many inputs and perform a range of non-linear tasks. They are comprehensive and cost-effective solutions for better crop advice and decision-making in general.

Programs that quantitatively explain plant-environment and soil feature interactions are used to provide crop recommendations. The technique starts with gathering a field soil sample for scientific soil testing. A field can be sampled so that the chemical composition of the soil sample, which is also influenced by temperature and rainfall, can precisely show the actual nutrient status of the field in a particular location, benefiting farmers and increasing production. This is the first fundamental premise of precision agriculture’s crop recommendation procedure for soil testing. B putting all this effort into the values and procedures, researchers can construct an intuitive crop recommendation system that delivers suggestions with a small margin of error depending on agricultural seasons and other parameters. This crop suggestion approach helps farmers make better-educated selections, resulting in more efficient and lucrative farming techniques.

From the literature, it is observed that most existing work needs to explain how and on what basis the crops were classified. They suggest crops based on soil properties or climatic conditions. If we only recommend using one of the two scenarios, the accuracy of the prediction decreases. We considered both scenarios in the proposed work to overcome the challenges above. This allows us to recommend the best dataset for the researchers, increasing the accuracy of crop recommendation.

Precision agriculture is essential in developing countries like India, where traditional or even ancient farming practices predominate. Precision agriculture, also known as site-specific agriculture, assists farmers in taking care of their land by increasing yield per unit of land and reducing pesticide and fertilizer waste. To classify yields by soil potential, statistical approaches are used. Farmers can harvest the right crops with management zones at the right sub-yields. This allows them to use less fertilizer, insecticide, and other inputs. Traditional yield prediction is based on a farmer’s previous crop harvesting at a specific time. Precision agriculture promotes yield prediction based on data. We use data mining, modeling, and statistical models to forecast crop harvest. Data-based yield estimates are getting closer to the actual crop yield. When selecting crops, many farmers overlook soil potential. The demand for “expert systems” is growing in tandem with the rise of Precision Agriculture. Precision agriculture, like other businesses, will increasingly rely on data. Using spatial data mining on the following datasets will become much more critical in the future and should be solved using intelligent informatics and geostatistics methods. Precision agriculture’s crop suggestion system can help farmers make better decisions. This technique chooses the best crops for a plot of land based on data and analytical models. This inspired us to conduct precision agriculture research.

The significant contributions of the paper are enlisted below.

The major challenge in the proposed work is the data. The data received from various sources is not in the proper format. The incorrect dataset format is transformed into the correct form by creating a data frame from all combinations of the supplied feature vectors.

Another finding is that numerical parameters such as N, P, K, and temperature have a discrete value range. It has been observed and tested that the most popular tree-based classification algorithms perform better with datasets that contain more categorical variables than numeric ones.

Recommendation of Crop Dataset using Cluster-based techniques.

The organization of the rest of the paper is as follows.

Section 2 discusses the background work of researchers in agriculture and yield prediction. Section 3 presents the proposed model for yield prediction and recommends which crop for cultivation. The model also suggests the best suitable time for the use of fertilizers. Section 4 discusses the results, and Section 5 concludes the paper.

The authors of [

The work published in [

The authors of [

The dataset contains 135 different crops in the target column that were grown in the corresponding location in India, according to the authors of [

The authors of [

Using data analytics, researchers in [

The authors of [

The author [

Recently the author [

This study improved a genetic algorithm (IGA) for recommending crop nutrition levels. The algorithm optimizes by exploring and exploiting the neighborhood. The model improved local optimization in population strategy to avoid premature local individuals. Diversity preserves population knowledge. In real-world datasets, the novel IGA method may outperform conventional recommendations. As a result, the program optimizes production and nutrient levels [

The end-to-end multi-objective neural evolutionary algorithm (MONEADD) for combinatorial optimization is introduced in this study. It is governed by decomposition and supremacy. MONEADD is an end-to-end approach that uses genetic processes and incentive signals to grow neural networks for combinatorial optimization tasks. Each generation retains non-dominated neural networks based on dominance and decomposition to accelerate convergence. Traditional heuristic approaches start from scratch for each test problem, whereas the trained model can answer equivalent questions during inference. Three multi-objective search strategies improve model inference performance [

The research was carried out in the state of Karnataka, India. The analysis examines eight variables for various crops. The minimum amounts of fertilizer required are nitrogen (N), phosphorus (P), and potassium (K). Another parameter used in the study is pH. Soil pH is a measure of the soil’s acidity or alkalinity. The other four parameters for increased crop productivity are temperature, rainfall conditions, soil type, and season, as shown in

Crops | N | Crops | P | Crops | pH | Crops | Season |
---|---|---|---|---|---|---|---|

Paddy | 70 | Paddy | 30 | Paddy | 5.5 | Paddy | Kharif |

Paddy | 80 | Paddy | 40 | Paddy | 6.5 | Wheat | Rabi |

Paddy | 90 | Paddy | 50 | Wheat | 6 | Jowar | Kharif |

Wheat | 90 | Wheat | 30 | Wheat | 7 | Barley | Rabi |

Wheat | 100 | Wheat | 40 | Jowar | 6 | Bajra | Kharif |

Wheat | 110 | Wheat | 50 | Jowar | 8.5 | Ragi | Kharif |

Crops | N | P | K | pH | Temp. | Rainfall | Season | Soil_type |
---|---|---|---|---|---|---|---|---|

Rice | 70 | 30 | 30 | 5.5 | 20 | 175 | Kharif | Clay |

Rice | 80 | 30 | 30 | 5.5 | 20 | 175 | Kharif | Clay |

Rice | 90 | 30 | 30 | 5.5 | 20 | 175 | Kharif | Clay |

Rice | 70 | 50 | 30 | 5.5 | 20 | 175 | Kharif | Clay |

Crops | N | P | K | pH | Temp. |
---|---|---|---|---|---|

Rice | n_70–90 | p_30–50 | k_30–50 | 5.5 | t_20–25 |

Rice | n_70–90 | p_30–50 | k_30–50 | 5.5 | t_20–25 |

Rice | n_70–90 | p_30–50 | k_30–50 | 5.5 | t_20–30 |

Rice | n_70–90 | p_30–50 | k_30–50 | 5.5 | t_20–30 |

Case 4.1.1 illustrates a dataset with many numeric variables on which the heuristic method is used to build a new dataset with both numeric and factor values set to “1,” as illustrated in

Case 4.1.1 represents a dataset with many numeric variables, but Case 4.1.2 and Case 4.1.3 are characterized by a dataset with multiple factor variables. Case 4.1.2 represents multi-level datasets and separates them into two heuristic iterations, one with numeric and factor methods set to “1” and the other with numeric and factor methods set to “1” and “2,” respectively. Case 4.0.3, on the other hand, works with a reduced-level dataset. It also makes use of a heuristic method with two integer value possibilities. We finally have six datasets, two for each example.

1 ds: = loadDataset ();

2 Apply standard preprocessing on the dataset;

3 if! target membership then

4 if ds is of mixed type, then

5 if numeric discrete-valued attributes, then

6 transform to factors;

7 ds:= generateDataset ();

8 end

9 if factor attributes have many levels, then

10 group them accordingly based on domain knowledge;

11 ds := generateDataset ();

12 end

13 for each outcome D of datasets do

14 // over the range to estimate the best k;

15 kbest := clusterValidation ();

16 // Investigate the variable’s variance and concentration;

17 lmd := lambdaEst ();

18 // Run the kproto function with kbest and lambda;

19 kpres := partition (ds, kbest, lmd);

20 ds := generateDataset ();

21 end

22 end

23 end

24 Update the cluster numbers at the end to the new datasets as target classes.

The algorithm shows how to create mixed-type datasets based on soil properties, season, rainfall, and temperature. The algorithm receives data partition (parameter D) as input, which represents the entire set of training tuples with excluded class labels (shown in line no. 3). The algorithm generates a class membership model in which objects are assigned to the class based on a lambda estimate [

The procedure begins with standard data preprocessing, such as variable normalization, discrete numeric to-factor conversion, and level reduction. The algorithm computes k-prototypes clustering for diverse datasets, shown in lines 13 to 20. K-prototyping is a modified version of the k-Means algorithm used to create clusters of large datasets with categorical values, recomputed cluster prototypes through iterations, and reassign clusters.

Equation

Heuristic methods are used for the computation of cluster prototypes as cluster means for numeric variables (standard deviation (num_method:= 2) or Variance (num_method:= 1)) and modes for factors

(1

The algorithm calls clusterValidation ( ) for each dataset retrieved for partitioning from lines 7 and 11. The preferred validation index is calculated using the following function:

We make use of the McClain and Silhouette clusters as our optimization model. A method for analyzing and confirming consistency within data clusters is the silhouette method. The method gives a clear graphic representation of each object’s classification accuracy. The silhouette value contrasts an object’s separation from other clusters with its cohesion with its own cluster. We can directly optimize the silhouette instead of using the average silhouette to evaluate a clustering from k-medoids or k-means. These methods assign points to the closest cluster, which is best.

The average dissimilarity of the i^{th} object to all other objects in the same cluster is given by a(i). b(i) = min(d(i, C)), where d(i, C) is the mean dissimilarity of the i^{th} object to all other objects other than those in the same cluster. In the meantime, the maximum index value indicates the optimal number of groups [

The dataset used for the experimental analysis is of mixed type (numeric and categorical). To achieve a better balance between the Euclidean distance of numeric variables and the simple matching coefficient of categorical variables, the optimal value of lambda is estimated to investigate the variables’ variance for k-prototype clustering. To accomplish this, we used lambda as a metric. The same explanation is given in all three cases. The initial crop dataset contains no class labels (presented in

Along with an explanation of the dataset’s actual contents,

The below

kproto(x, k, lambda = NULL, iter.max = 100, nstart = 1, na.rm = TRUE)

Numeric method

Factor method

We use lambda > than zero, to balance the simple matching coefficient between categorical variables and Euclidean distance between numerical variables. The order of a vector variable-specific factor must match the data variables. All variable distances are multiplied by lambda values.

The output of a function is a list with four components. Using

The results of the k-prototypes clustering for cluster interpretation are shown in

Based on the above cluster interpretations, it is possible to conclude that observations/readings from one cluster differ considerably from those from other clusters. Consider an example as shown in

The obtained λ value along with the above-assigned parameter values; the resulting cluster numbers are tagged with class labels and furnished in

The below

The issues in the dataset are factors that have multi-levels [

With the above-mentioned parameter values, the resulting cluster numbers are prepended with the word class as shown in

The resulting cluster numbers are prepended with the word class using the above-mentioned parameter values, as shown in

From

To cut down the computation time when dealing with the overwhelming levels, we have made a group of nearer values while keeping the margin of ±10 while simultaneously ensuring that our results remain impartial.

In

Visualization of k-prototypes clustering results for cluster interpretation of pH and Season is presented in

According to

With the above-assigned parameter values and λ, the resulting cluster numbers are tagged with world-class due to class labels shown in

With the above-assigned parameter values and λ, the resulting cluster numbers are prepended with the word class due to class labels shown in

The observation of the tasks performed on the Case 4.1.3 dataset suggests that all variables are categorical except for two numeric ones. The variation of the numeric method integer value doesn’t significantly affect the formation of clusters. In the last, if we observe

Data mining and analytics involve evaluating and modeling data to draw conclusions and enhance decision-making. Precision agriculture uses advanced data mining tools to advance agriculture. Lack of farm management knowledge prevents selecting suitable datasets and crops for certain agro-fields. The cluster analysis performed by the algorithm iteration for the dataset with numeric variables reveals that values in one cluster differ significantly from values in other clusters. In addition, except for two categorical variables, the majority of variables in the same example are numerical. This means that changing the factor method’s integer value has no discernible effect on cluster formation. Case 2 draws the same conclusion as the first: all variables are numeric, but two and the factor method integer values are not closely related to cluster formation. The third case, with a dataset of reduced levels of many categorical variables, also reveals a value difference between clusters. The research’s overall findings can be divided into three categories. Clusters differ significantly between the Kharif and Rabi seasons. The vast majority of variables are categorical in nature rather than numerical. The integral value of the numerical approach has little effect on cluster formation.

The authors would like to thank Bapuji Institute of Engineering and Technology for providing the resources for this research. The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia, for funding this research work through project number IF_2020_NBU_322.

This research work was funded by the Institutional Fund Projects under Grant No. (IFPIP:

The authors declare that they have no conflicts of interest to report regarding the present study.