A substantial amount of the Indian economy depends solely on agriculture. Rainfall, on the other hand, plays a significant role in agriculture–while an adequate amount of rainfall can be considered as a blessing, if the amount is inordinate or scant, it can ruin the entire hard work of the farmers. In this work, the rainfall dataset of the Vellore region, of Tamil Nadu, India, in the years 2021 and 2022 is forecasted using several machine learning algorithms. Feature engineering has been performed in this work in order to generate new features that remove all sorts of autocorrelation present in the data. On removal of autocorrelation, the data could be used for performing operations on the time-series data, which otherwise could only be performed on any other regular regression data. The work uses forecasting techniques like the AutoRegessive Integrated Moving Average (ARIMA) and exponential smoothening, and then the time-series data is further worked on using Long Short Term Memory (LSTM). Later, regression techniques are used by manipulating the dataset. The work is benchmarked with several evaluation metrics on a test dataset, where XGBoost Regression technique outperformed the test. The uniqueness of this work is that it forecasts the daily rainfall for the year 2021 and 2022 in Vellore region. This work can be extended in the future to predict rainfall over a bigger region based on previously recorded time-series data, which can help the farmers and common people to plan accordingly and take precautionary measures.

Agriculture is one of the major pillars of the Indian economy, which stumps up $400 billion to the economy and involves a total of 58% of the Indian population [

In this work, analysis of time-series data has been carried out, using multiple machine learning approaches, for the rainfall dataset of the Vellore region of Tamil Nadu, India. The dataset comprises the complete rainfall data over a decade–a period of ten years, from 2010 to 2019–in the region. Furthermore, the work stars feature engineering, which has been carried out for improving the overall performance of the predictive modeling [

The major contributions of this work are as follows:

The proposed system forecasts the daily rainfall in Vellore region in the years 2021 and 2022 on a daily basis.

This work uses forecasting methodologies like the AutoRegessive Integrated Moving Average (ARIMA) and the exponential smoothening for the short-term and very short-term forecasting of the data, as they are flexible and allow components that are ‘AutoRegressive’ or are ‘moving average’.

Furthermore, the time-series data is operated on using Long Short Term Memory (LSTM) and the regression techniques are used by manipulating the dataset.

After trying out several models, this work compares the efficiency of them benchmarked with several accuracy metrics on a test dataset. The work considers a dataset of rainfall for a decade over a small region, however, in its prospect, it can be extended in the future to predict rainfall over a bigger region and multiple models could be combined for better accuracies and efficiencies.

Composite Vellore district lies between 12° 15’ to 13° 15’ North latitudes and 78° 20’ to79° 50’ East longitudes in Tamil Nadu State. It is bounded on the north by Chittoor District of Andhra Pradesh, on the South by Thiruvannamalai District, and the west by Krishnagiri District, and the East by Thiruvallur and Kanchipuram districts (

Physiographically, the western parts of the district are endowed with hilly terrain and the eastern side of the district is mostly covered by rocky plains. The district has a population of 3,928,103 as per the 2011 census. The major rivers of the district are the Palar river and the Ponnai river. Generally, over a year, these rivers are almost always dry and sandy. The Palar river physically splits the district into 2 halves as it flows from Andhra Pradesh and enters the district at Vaniyambadi Taluk and passes through Ambur, Gudiyatham, Vellore, Katpadi, Wallajah, and Arcot Taluks. The Palar river had experienced floods at a frequency of once in 5 to 7 years–the last floods were reported in 1996 and 2001. The Ponnai river which flows from Andhra Pradesh enters the Vellore District at Katpadi Taluk and merges with the Palar river at Wallajah Taluk. Besides, Malattar, Koudinya Nadi, Goddar, Pambar, Agaram Aaru, Kallar, and Naganadi also flow through the district.

Generally, the temperature and rainfall in the district are moderate. The district records a maximum temperature of 40.2°C and a minimum of 19.5°C. Especially, Arakkonam Taluk enjoys a moderate climate throughout the year. On the other hand, Vellore, Walajah, and Gudiyatham Taluks–which are surrounded by hills–are subjected to extreme climate conditions either being very hot during summer or very cold during the winter season. In the Thirupathur Taluk, the climate is cold during winter but moderate during the other seasons. The district receives rainfall during the southwest and northeast monsoon period, and the average annual rainfall is around 976 mm. As per the study of the Tamil Nadu state Climate Change cell, Department of Environment, Government of Tamil Nadu, the annual rainfall for Vellore may reduce by 5.0% by the end of the century.

Though according to the simple probabilistic approach, the possibility of rainfall can be 50%–either it is going to rain, or it is not–in real life, however, several circumstances and other factors might turn the tables. It is possible to predict the occurrence of rainfall by studying the trends and exploring the patterns of previous rainfall over the region–in simple words, a place where it rains very often it would not be a surprise if it rains the other day, similarly, for a place where it has not rained over a significant period of time it can be predicted that it would not rain the other day. These are, however, predictions and can be proved, although very rare, to be inaccurate in some circumstances. In this work, various machine learning techniques are worked around on the rainfall dataset in order to compare and decide which method could be preferred over the others for precisely forecasting rainfall. The decision is based on several factors including their performance and accuracy scores. The architectural diagram of the proposed system is depicted by

The raw rainfall data is collected from various meteorological departments as meteorological forcing variables, which later goes through hydrologic modeling for data assimilation. The data retrieved from the remote sensors of the satellites are also assimilated at that place. Further, the meteorological data and satellite data are stored in a data warehouse and further sent to the feedback repository. The feedback repository is used for rainfall data analysis by considering the metrics and the parameters. Since the metrics and the parameters are dynamic, the analysis updates from time to time. The analysis is also updated within the feedback repository. The meteorological data, in this work, is undergone feature engineering and parameters are extracted. Furthermore, the result is fed to the machine learning model for training and testing. After the training and testing are carried out on the dataset, the rainfall is finally forecasted.

The dataset is particularly time-series data comprising data of the annual rainfall in millimeters for ten years–from the year 2010 to the year 2019–around the Composite Vellore region of Tamil Nadu, India. It only consists of the dates and the rainfall measure in millimeters. The dataset has monthly distributions of rainfall over the places in Vellore–Alangayam, Ambur, Arakkonam, Arcot, Gudiyatham, Kaveripakkam, Melalathur, Sholingur, Tirupattur, Vaniyambad, Vellore, and Wallajah. To get a wholistic view to compare the rainfall in the study area, the total rainfall in the state of Tamil Nadu for the years 2010 to 2019 is presented in

The machine learning models for forecasting consists of a wide range of techniques-from the conventional forecasting methodologies such as ARIMA [

The raw data might not be consistent all the time, they might contain empty values or incorrect values, and sometimes they even contain data that are irrelevant from an analysis perspective. These inconsistencies in the data could result in a disputed outcome that might include the failure of the entire model that is built [

After the data has been refined, it was subjected to exploratory data analysis. The python library, Altair, has been used for having precise statistical visualization. The graphs helped in interpreting the trend of rainfall across different regions over the year for the entire decade [

In any time-series data, the component of the data that tends to alter over a period of time without repeating itself periodically is termed as its trend-it can be increasing or decreasing, linear or non-linear. In contrast to the trend of time-series data, seasonality can be defined as the component of the data that can modify over a span of time and also repeats itself. The trend and the seasonality are responsible for the time-series data to change at varying times. The dataset using in this work is statistically tested using Augmented Dickey-Fuller (ADF) test [

On performing the Augmented Dickey-Fuller Test on the refined dataset using lag equal to a period of a year, the test statistic value came out to be −2.24 and the critical value came out to be −2.56 for 10%. Since the test statistic value turned out to be slightly greater than the critical value, the null hypothesis of the ADF test is accepted and hence the alternative hypothesis is automatically rejected. So, the data used in this work is not stationary, but seasonal. The seasonality from the data is removed by differencing the data with an interval equal to that of the lag.

The Auto-Regressive Integrated Moving Average (ARIMA) model, at a high level, is an analysis model based on statistics. It is broadly used on time-series data, as it is known for providing better insight into the dataset and it also helps in predicting future trends [

Further, another term that is frequently associated with the ARIMA model is moving average–which is another statistical measure used for the analysis of stock data. The moving average is widely used for analyzing time-series data–for the calculations of averages basing on a moving window [

Another widely used model for forecasting time-series data is Holt Winters’ Exponential Smoothing–which manifests a trend in the dataset, as well as a seasonal variation [

The exponential smoothing methodology is able to forecast predictions using the weighted averages of all the previous values–In this case, the weights are made to sink exponentially from the data that is most recent to the one that is the oldest. It is considered by default that the latest data is much more important than older data [

The Long Short Term Memory (LSTM) is a succession to the conventional Recurrent Neural Network and is extensively used for time-series data. In addition, it provides gates to keep the required information stored, and the layers that are fully connected provide a smooth flow of error across the gates. Each repeating module has several gates and several cell states, to apply functions onto the output from the previous cell and the input [

Regression cannot be applied to time-series data, as a result, the data needs to be converted to a regular dataset that cannot be autocorrelated. The rain dataset used in this work is time-series data that has trends and seasonality as analyzed from the previous investigation. However, the dataset is transformed into a regular dataset with the help of feature engineering-where new features are generated from the old features. The continuous data is converted into categorical ones, where for months and rainfall as shown in

Months | Mapped value |
---|---|

January | 0 |

February | 0 |

March | 0 |

April | 0 |

May | 2 |

June | 2 |

July | 2 |

August | 2 |

September | 2 |

October | 1 |

November | 1 |

December | 0 |

Rainfall (in mm) | Mapped value |
---|---|

>100 | 5 |

>50 | 4 |

>25 | 3 |

>2 | 2 |

>0 | 1 |

0 | 0 |

For

Support Vector Regression (SVR) model is an advancement to the conventional Support Vector Machines (SVM)-which are widely used for solving the problems that are of classification categories [

Linear Regression is another well-known regression model, which is very simple but still extensively used for classification problems. In this model, a linear equation is generated based on input and output variables, where the coefficients of the lines are generated during the learning process of the model [

XGBoost regression is an enhanced form of the conventional Gradient Boosting methodology. In gradient boosting, an ensemble algorithm is administered, where the umbrella term boosting encompasses the sequential addition of the models in the ensemble. Besides, the algorithm also governs multiple decision trees and assigns respective weights to these trees to explicitly give more importance to particular trees for determining the final output [

In this work, the XGBoost model is created using the hyperparameter values as–the base score is set to 0.5 that is the initial score of prediction of all instances, the booster is set as ‘gbtree’ that uses tree-based models. The step size shrinkage value of the model–which is also known as the learning rate, or the eta–is taken as 0.3 for preventing overfitting, and the minimum loss reduction value that is used in the model is used as 0 to make the algorithm less conservative. In addition to setting the importance type as gain, there are no other interaction and monotone constraints used, hence, the maximum delta step is set to be 0. The model does not use parallelization; thus, the number of parallel trees is set to 1.

The maximum depth of the tree is set as 35 with minimum summation of instance weight needed in the child set as 1–thus, every time partition occurs in the tree, the sum of instance weights of the nodes is less than 1. The L1 regularization is kept as 0, while the L2 regularization term on the weights is made to be 1. The seed, or what is also known as the random state is selected as 123. In addition, no verbosity is added to the model. Furthermore, the subsample ratios of the columns for each level, node and tree–while construction–is set to be 1, hence, while subsampling takes place, it occurs once when each tree is constructed, and once each new depth level is grasped in that tree and once during each period a new split is evaluated. Finally, a total of 3 estimators are used with exact greedy algorithm as the tree method–since the dataset is small.

The data from the dataset is visualized using different plots for better insight into it. The results of all the algorithms are compared using several different parameters, which are widely accepted. The final algorithm that is selected to be the best fit for the dataset is decided on the basis of these results. Before the analysis of the performances of each algorithm based on the score of their accuracy metrics, the dataset is individually envisioned. The plot of the autocorrelation of the magnitude of rainfall over a period of one year, along with lags from 0 to 365, is shown in

It can be observed from

The SVR forecast plot of the Alangayam district during August 2019 can be seen in

The Linear Regression forecast plot of the Alangayam district during August 2019 can be seen in

The XGBoost forecast plot of the Alangayam district during August 2019 can be seen in

The Mean Absolute Error (MAE) is a widely used evaluation metric, where the average of the total absolute magnitude of errors is considered. The Root Mean Squared Error (RMSE) is another well-known evaluation metric that is calculated by taking the square root of the average of the sum of the squared error differences. Relative Absolute Error (RAE) is the total absolute error that further normalizes the final value by dividing it by the total absolute error of the simple predictor. Root Relative Squared Error (RRSE) takes the square root of the normalized squared error differences, where it is normalized by dividing with the total squared error of simple predictor.

The tabulation of the different evaluation metrics and magnitudes, for all six approaches, is shown in

Model | MAE | RMSE | RAE | RRSE | Training time (s) | Correlation coefficient |
---|---|---|---|---|---|---|

ARIMA model | 4.14 | 12.02 | 1.22 | 1.25 | 1.11 | 0.247 |

Exponential smoothing | 3.52 | 9.59 | 0.975 | 1.001 | 0.02 | −0.0044 |

LSTM | 2.955 | 12.07 | 0.99 | 1.02 | 380.89 | 0.04 |

SVR | 6.92 | 9.33 | 1.04 | 1.08 | 2.55 | 0.27 |

Linear regression | 3.52 | 8.21 | 0.92 | 0.95 | 0.02 | 0.28 |

XGBoost regression | 1.57 | 5.57 | 0.642 | 0.66 | 0.43 | 0.78 |

The correlation coefficient and total training time plots of the six algorithms are shown in

To plot the data, the evaluation metrics are normalized for MAE, RMSE, RAE and RRSE. For normalizing the data, each column is divided with the maximum value of the respective column. The normalized values can be observed in

The comparative visualization of the normalized MAE scores of the six algorithms are shown in

The comparative visualization of the normalized RMSE scores of the six algorithms are shown in

Model | MAE | RMSE | RAE | RRSE |
---|---|---|---|---|

ARIMA model | 0.598266 | 0.995857 | 1 | 1 |

Exponential smoothing | 0.508671 | 0.794532 | 0.79918 | 0.8008 |

LSTM | 0.427023 | 1 | 0.811475 | 0.816 |

SVR | 1 | 0.772991 | 0.852459 | 0.864 |

Linear regression | 0.508671 | 0.680199 | 0.754098 | 0.76 |

XGBoost regression | 0.226879 | 0.461475 | 0.52623 | 0.528 |

The normalized RAE metrics scores of the six algorithms are shown in

The normalized RRSE metrics scores of the six algorithms are shown in

Again, from all these visualization plots, it can be concluded that the XGBoost Regression technique surmounts the rest, and thus can be concluded as the best-suited algorithm for this dataset.

In this work, the rainfall dataset of the Vellore region is forecasted using six different algorithms–the ARIMA model, Holt-Winters’ Exponential Smoothing, LSTM, SVR, Linear Regression, and XGBoost Regression. However, since regression cannot be applied directly on time-series data, this work also discusses the feature engineering associated with converting the time-series data to regular data. After the models are successfully built, their predictions are individually assessed and plotted for better insight. Further, the accuracy metrics are evaluated based on–MAE, RMSE, RAE, and RRSE. The total time that is taken for training the models and the final correlation coefficient of the models are also considered for evaluation. After implementation, it was observed that the XGBoost Regression was able to predict most accurately compared to the rest. On the basis of the accuracy metrics plots, again it was observed that XGBoost Regression had the minimum overall error-based on MAE, RMSE, RAE, and RRSE. Another significant inferred point from the experimentation was that the LSTM took the highest amount of training time of six min. The correlation coefficient of XGBoost Regression was the highest and Holt-Winters’ Exponential Smoothing is not correlated. In the months of July to November, maximum rainfall is expected in Vellore region. Further, from February to April, months there would be minimum rainfall in Vellore region. Our proposed system forecasts the daily rainfall in Vellore region in the years 2021 and 2021 on a daily basis. Finally, it was concluded that the best-suited model for this dataset is XGBoost Regression. This work can be extended in the future for forecasting rainfall data using Deep Learning-which could be used for obtaining higher accuracy that would be beneficial for the meteorological department in predicting with better precision. Furthermore, the comparison can be widened by adding newer algorithms to the existing ones, or combining multiple algorithms for better results. Rainfall prediction could help people prepare themselves accordingly for disaster, and can help the agricultural sector on a broader scale.

Authors should thank those who contributed to the article but cannot include themselves.