For training the present Neural Network (NN) models, the standard technique is to utilize decaying Learning Rates (LR). While the majority of these techniques commence with a large LR, they will decay multiple times over time. Decaying has been proved to enhance generalization as well as optimization. Other parameters, such as the network’s size, the number of hidden layers, dropouts to avoid overfitting, batch size, and so on, are solely based on heuristics. This work has proposed Adaptive Teaching Learning Based (ATLB) Heuristic to identify the optimal hyperparameters for diverse networks. Here we consider three architectures Recurrent Neural Networks (RNN), Long Short Term Memory (LSTM), Bidirectional Long Short Term Memory (BiLSTM) of Deep Neural Networks for classification. The evaluation of the proposed ATLB is done through the various learning rate schedulers Cyclical Learning Rate (CLR), Hyperbolic Tangent Decay (HTD), and Toggle between Hyperbolic Tangent Decay and Triangular mode with Restarts (T-HTR) techniques. Experimental results have shown the performance improvement on the 20Newsgroup, Reuters Newswire and IMDB dataset.

It will speed up the review and typesetting process. Neural Networks (NNs) are models with successive layers of neurons that have been in existence for decades. Training for these NNs can be done either in an unsupervised or supervised manner. The most frequently used machine learning technique for either shallow or deep networks is supervised learning Soydaner [

The DNN model Shin et al., [

When training DNNs, it is generally beneficial to decrease the Learning Rate (LR) Yedida et al., [

The choice of algorithm for a NN’s Yang et al., [

Investigation of the hyperparameter search space generally needs huge number of epochs for training the model with unique settings of the hyperparameters Wu et al., [

Dropout is a principle in which some of the neurons are not considered in the training process. These randomly selected neurons are dropped out. That is, during the forward pass the role of the activation of downstream neurons is removed temporarily. The same way during the backward pass, updation of weights is not implemented to the selected neuron. This reduces the time for the whole training process.

Notwithstanding its popularity, the training of NNs is rife with numerous problems Alyafi et al., [

Optimizing the structure of the network and loss function is a NP (nondeterministic polynomial time) hard process Vasudevan [

Effective hyperparameter selection would result in earlier convergence to the global minimum point on the error surface. This results in improved task performance while still avoiding issues such as overfitting. In this work, it is proposed to optimize the set of hyperparameters (Learning rate, Activation Function, Dropout) for RNNs, LSTM and BiLSTM by using the Adaptive Teaching Learning Based Optimization (ATLBO) algorithm. The rest of this investigation has been organized into the following sections. Section two details the related works in literature. Section three elaborates on the various methods employed in this work. Section four describes the experimental outcomes, and Section five gives the work’s conclusions.

Chen et al., [

Fischetti et al., [

Li et al. [

Yang et al. [

Liu et al. [

Smith [

Hsueh et al., [

Literature survey based on NNs, DNNs, decaying LRs, optimizing hyperparameters for deep learning used in various domains are presented here. The observation from the surveys, they have concentrated only on anyone of the Hyperparameter in the model to achieve better accuracy. Especially the learning rate is decayed by using various methods (CLR, HTD and T-HTR) Yu et al., [

This Learning Rate policy’s essence comes from the observation that even though increasing the LR may have a short-term negative effect; it can accomplish a more long-term beneficial effect. In view of this observation, the concept would be to allow the LR to vary within a range of values instead of adopting a value that is either stepwise fixed or exponentially decreasing. That is, the minimum and maximum boundaries will be set, and the learning rate will cyclically vary between these bounds Apaydin et al., [

With due consideration to the loss function topology, it is possible to achieve an intuitive understanding of the CLR methods’ operating principles. Dauphin et al., had debated that the difficulty in minimizing the loss emerged from the saddle points instead of the poor local minima. Saddle points have small gradients, which slow down the learning procedure. Nevertheless, the learning rate could be increased to allow for a swifter traversal of the saddle point plateaus. In all likelihood, the optimum learning rate would be between the bounds, and the near-optimal learning rates would be utilized throughout the training. For the CLR’s optimization, the root means the square error was employed as the fitness. The ATLBO algorithm was used to detect the optimal learning rate.

This work had proposed a new LR scheduler referred to as the Hyperbolic-Tangent decay (HTD) scheduler. When compared with the step decay scheduler, the HTD was found to have fewer hyperparameters to tune and also showed better performance in all the experimentations. On the other hand, when compared with the cosine scheduler, the HTD needed slightly more hyperparameters to tune for higher performance and also exceeded the performance of cosine schedulers Chen et al., [

This work will propose to utilize hyperbolic tangent functions for the Learning Rate scheduling as per the below

Here,

In HTD (L, U), the hyperparameters U will affect the final learning rate. As an example, when U = 3, the final learning rate will be

The T-HTR’s concept is to flip the Learning Rate scheduler’s hyperbolic tangent decay and triangular mode amongst the iterations in the batches of the epoch. This will reduce the training time’s length and also enhance the model’s learning. The earlier epochs’ specifics (like epoch number, learning rate and gradient value) can be used to decide on the LR scheduler of the next epoch’s initial iterations. An average of all the batches’ LR is taken. This average will be utilized as the next epoch’s initial Learning Rate. The ATLBO algorithm is used to find the optimal learning rate at the earliest.

All of the evolutionary- and swarm intelligence-based algorithms are probabilistic algorithms and require common controlling parameters, like the population size, number of generations, elite size, etc. In addition to the common control parameters, algorithm-specific control-parameters are required. For example, Genetic Algorithm (GA) uses the mutation rate and crossover rate. Similarly, Particle Swarm Optimization (PSO) uses the inertia weight, as well as social and cognitive parameters. The proper tuning of algorithm-specific parameters is a very crucial factor that, affects the performance of the above-mentioned algorithms. The improper tuning of algorithm-specific parameters either increases the computational effort or yields a local optimal solution. Therefore, introduced the TLBO algorithm, which requires only the common control parameters and does not require any algorithm-specific control parameters. Other evolutionary algorithms require the control of common control parameters as well as the control of algorithm-specific parameters. The burden of tuning common control parameters is comparatively less in the TLBO algorithm. Based upon the above discussion, TLBO algorithm steps are Rao et al., [

Step 1: The optimization parameters are set with initial values

Size of the population

Number of iterations

Number of subjects (decision variables)

Step 2: The random population is created based on the size of the population and the number of design variables.

Step 3: The fitness of the potential solutions are evaluated and the population is organized based on their fitness value.

Step 4: Teacher Phase: The mean of the population is calculated. This provides the mean of the particular subject.

Step 5: Learner Phase: The solution is simulated by: learners gain knowledge by mutually interacting among themselves.

Step 6: From step 3, it is repeated till the stopping criteria is met.

Step 7: Stop the process.

Thus, the TLBO algorithm is simple, effective and involves comparatively less computational effort. The ATLBO algorithm is the improved form of TLBO (Chen et al. 2018) algorithm to make it more effective in finding the optimized values for the hyperparameters in the model.

Teaching-learning is a critical process in which each individual would attempt to learn something from the other individuals so as to improve them. This algorithm sought to replicate a classroom’s conventional teaching-learning phenomenon by simulating two fundamental learning modes: (i) through the teacher (termed the teacher phase) and (ii) through interactions with other learners (termed the learner phase). Being a population-based algorithm, the ATLBO would consider a group of students (that is, the learner) as the population, and the different subjects who were provided to the learners would be akin to the optimization problem’s different design variables. The learner’s results would be analogous to the optimization problem’s value of fitness. The teacher would be the whole population’s best solution. Below is the explanation of the ATLBO algorithm’s operation with the teacher phase as well as the learner phase Wang et al. [

This algorithm phase would simulate the students’ (that is, the learners) learning through the teacher. During this phase, a teacher would convey knowledge amongst the learners and would try to raise the class’s mean result. Consider that ‘m’ number of subjects (that is, design variables) were given to ‘n’ number of learners (that is, size of the population, k = 1, 2 . . . n). At any sequential teaching-learning cycle i, M_{j,i} would be the learners’ mean result in a certain subject ‘j’ (j = 1, 2 . . . m). A teacher would be a subject’s most experienced and knowledgeable person. Hence, the teacher in the algorithm would be the best learner in the whole population. Suppose that X_{total−kbest,i} would be the result of the best learner considering all the subjects who have been identified as that cycle’s teacher. Even though the teacher would put maximum effort into increasing the entire class’s knowledge level, the learners would only gain knowledge in accordance with the quality of teaching delivered by a teacher as well as the class’s quality of learners present. In view of this fact,

where, X_{j,kbest,i} will indicate the teacher’s (that is, the best learner) result in the subject j, T_{F} will indicate the teaching factor, that determines the value of mean to be modified, and r_{i} will indicate the random number in the range [0, 1]. The T_{F} value can be either 1 or 2. The T_{F} value will be randomly decided with equal probability as per below

where, the rand will indicate the random number in the range [0, 1]. T_{F} will not be a parameter of the algorithm. The T_{F} value will not be offered as the algorithm’s input. Instead, the algorithm will use _{F} value. Moreover, these parameters do not get offered as the algorithm’s input (which is contrary to the supply of the Genetic Algorithm’s (GA) crossover probability and mutation probability, the Particle Swarm Optimization (PSO’s) inertia colony size and limit, and so on). Therefore, the algorithm does not require the tuning of r_{i} and T_{F} (which is contrary to the tuning of the GA’s crossover probability and mutation probability, the PSO’s inertia weight and cognitive and social parameters, the ABC’s colony size and limit, and so on). For its operation, the ATLBO algorithm only needs to tune common control parameters such as the size of the population and the number of generations. These common control parameters are requisite for the operation of all population-based optimization algorithms. Therefore, the ATLBO is termed as an algorithm-specific parameter-less algorithm.

The algorithm’s performance is affected by the values of both r_{i} and T_{F}. The students’ understanding from best teacher may be dissimilar and arbitrary. The learning of students from teacher is uncertain some times when the teaching factor in

The entropy for the particular subject i is:

where l is lower bound and u is upper bound and

Based on the

This equation

There is a random selection of two learners, P and Q, such that

(While the above equations are for maximization problems, the reverse will be true for minimization problems)

The optimization of hyperparameters (learning rate and dropout) for RNN, LSTM and BiLSTM models is attained through the teacher phase and learner phase of the ATLBO Algorithm. It is used for the classification problem. The steps are given below.

Step 1: Initialize the population size, number of decision variables and Termination criterion

Step 2: Evaluate the mean of each decision variable

Step 3: Estimate the Fitness value

Step 4: Select the individual with best fitness as teacher

Step 5: Implement the teacher phase and Learner phase of ATLBO algorithm

Step 6: Get the optimal values for the hyperparameters learning rate and dropout of the Model

Step 7: Train the RNN/LSTM/BiLSTM Model with the optimal hyperparameters

Step 8: Test the model with the optimal hyperparameter values

Step 9: Find the Accuracy/Error

Step 10: End

The teacher and Learner phase is repeated many times till the stopping criterion is met. The stopping criterion is the maximum number of iterations considered in the ATLBO algorithm. At the end, the optimal hyperparameters learning rate for the learning rate scheduler and dropout is obtained. This optimal learning rate is used in the learning rate schedulers CLR, HTD and T-HTR.

The learning rate schedulers and dropout are utilized in the RNN, LSTM and BiLSTM models for the training process. The model is trained and tested on the datasets 20Newsgroup, ReutersNewswire and IMDB. The classification accuracy accomplished for the models using ATLBO algorithm is recorded in the

Model/LR scheduler | ATLBO | TLBO |
---|---|---|

Accuracy in (%) | ||

RNN + CLR | 94.86 | 93.12 |

RNN + HTD | 95.69 | 94.11 |

RNN + T-HTR | 96.23 | 95.35 |

LSTM + CLR | 95.89 | 95.13 |

LSTM + HTD | 96.92 | 95.89 |

LSTM + T-HTR | 97.91 | 96.76 |

BiLSTM + CLR | 96.59 | 95.91 |

BiLSTM + HTD | 97.52 | 96.55 |

BiLSTM + T-HTR | 98.67 | 97.71 |

Model/LR scheduler | ATLBO | TLBO |
---|---|---|

Accuracy in (%) | ||

RNN + CLR | 95.18 | 93.57 |

RNN + HTD | 96.37 | 94.61 |

RNN + T-HTR | 97.11 | 95.89 |

LSTM + CLR | 95.89 | 94.32 |

LSTM + HTD | 96.32 | 95.42 |

LSTM + T-HTR | 97.86 | 96.10 |

BiLSTM + CLR | 96.11 | 93.78 |

BiLSTM + HTD | 97.10 | 96.39 |

BiLSTM + T-HTR | 98.39 | 97.11 |

Model/LR scheduler | ATLBO | TLBO |
---|---|---|

Accuracy in (%) | ||

RNN + CLR | 94.44 | 91.29 |

RNN + HTD | 95.86 | 93.35 |

RNN + T-HTR | 96.84 | 95.49 |

LSTM + CLR | 95.63 | 94.81 |

LSTM + HTD | 96.38 | 95.27 |

LSTM + T-HTR | 97.43 | 95.88 |

BiLSTM + CLR | 96.27 | 95.06 |

BiLSTM + HTD | 97.12 | 95.87 |

BiLSTM + T-HTR | 98.34 | 96.32 |

The same way the optimal hyperparameter got using TLBO algorithm is given as input to the model. It is trained and tested on the datasets 20Newsgroup, Reutersnewswire and IMDB. The classification accuracy attained for the models using TLBO algorithm is documented in the

Classification accuracy statistic by itself will not determine which learning model is the best. There are several measures for evaluating the performance of various models with specified optimal features, such as Precision, Recall, and F-Measure.

In this section, the 20Newsgroup, Reuters Newswire and IMDB movie reviews datasets are evaluated. The features are extracted using TF_IDF. The RNN CLR, RNN HTD, RNN T-HTR, LSTM CLR, LSTM HTD, LSTM T-HTR, BiLSTM CLR, BiLSTM HTD and Bi LSTM T-HTR methods [

Techniques | 20Newsgroup | Reuters Newswire | IMDB |
---|---|---|---|

RNN + CLR | 0.792 | 0.7982 | 0.8021 |

RNN + HTD | 0.8239 | 0.8296 | 0.8455 |

RNN + T-HTR | 0.855 | 0.8604 | 0.8651 |

LSTM + CLR | 0.8047 | 0.8066 | 0.8134 |

LSTM + HTD | 0.824 | 0.8344 | 0.8488 |

LSTM + T-HTR | 0.8567 | 0.8645 | 0.8654 |

BiLSTM + CLR | 0.8116 | 0.8195 | 0.8235 |

BiLSTM + HTD | 0.8259 | 0.8392 | 0.8521 |

BiLSTM + T-HTR | 0.9039 | 0.9051 | 0.9089 |

Techniques | 20Newsgroup | Reuters Newswire | IMDB |
---|---|---|---|

RNN + CLR | 0.7966 | 0.8006 | 0.8027 |

RNN + HTD | 0.824 | 0.8312 | 0.8487 |

RNN + T-HTR | 0.8567 | 0.8645 | 0.8654 |

LSTM + CLR | 0.8065 | 0.8092 | 0.8155 |

LSTM + HTD | 0.8248 | 0.8386 | 0.8509 |

LSTM + T-HTR | 0.8604 | 0.8651 | 0.8661 |

BiLSTM + CLR | 0.8119 | 0.8225 | 0.8239 |

BiLSTM + HTD | 0.8285 | 0.8432 | 0.8541 |

BiLSTM + T-HTR | 0.8882 | 0.8932 | 0.8981 |

Techniques | 20Newsgroup | Reuters Newswire | IMDB |
---|---|---|---|

RNN + CLR | 0.7943 | 0.7994 | 0.8024 |

RNN + HTD | 0.8239 | 0.8304 | 0.8471 |

RNN + T-HTR | 0.8558 | 0.8624 | 0.8652 |

LSTM + CLR | 0.8056 | 0.8079 | 0.8144 |

LSTM + HTD | 0.8244 | 0.8365 | 0.8498 |

LSTM + T-HTR | 0.8585 | 0.8648 | 0.8657 |

BiLSTM + CLR | 0.8117 | 0.821 | 0.8237 |

BiLSTM + HTD | 0.8272 | 0.8412 | 0.8531 |

BiLSTM + T-HTR | 0.896 | 0.8991 | 0.9035 |

From

From the

From

When ATLBO algorithm is used, the average training time of the RNN, LSTM and BiLSTM models on the three datasets is shown in the

Model/LR scheduler | 20Newsgroup | Reuters Newswire | IMDB |
---|---|---|---|

RNN + CLR | 13588 s | 290 s | 333 s |

RNN + HTD | 12897 s | 198 s | 288 s |

RNN + T-HTR | 11124 s | 120 s | 189 s |

LSTM + CLR | 12212 s | 133 s | 278 s |

LSTM + HTD | 12411 s | 111 s | 219 s |

LSTM + T-HTR | 10980 s | 88 s | 178 s |

Bi LSTM + CLR | 21859 s | 78 s | 120 s |

Bi LSTM + HTD | 19214 s | 36 s | 142 s |

BiLSTM+T-HTR | 13798 s | 27 s | 119 s |

Model/LR scheduler | 20Newsgroup | Reuters Newswire | IMDB |
---|---|---|---|

RNN + CLR | 16568 s | 385 s | 458 s |

RNN + HTD | 14841 s | 278 s | 377 s |

RNN + T-HTR | 13325 s | 194 s | 221 s |

LSTM + CLR | 17322 s | 236 s | 278 s |

LSTM + HTD | 13419 s | 182 s | 219 s |

LSTM + T-HTR | 12589 s | 132 s | 178 s |

Bi LSTM + CLR | 19698 s | 119 s | 355 s |

Bi LSTM + HTD | 17352 s | 89 s | 259 s |

Bi LSTM + T-HTR | 14812 s | 63 s | 218 s |

In all possible combinations, the ATLBO algorithm performs better. Also, it demonstrates that the BiLSTM+T-HTR has higher precision for 20Newsgroup, the Reuters Newswire datasets, and the IMDB movie reviews datasets in comparison to RNN CLR, RNN HTD, RNN T-HTR, LSTM CLR, LSTM HTD, LSTM T-HTR, BiLSTM CLR, and BiLSTM HTD. The average training time is also reduced much for BiLSTM T-HTR.

Optimization of DNNs is primarily accounted for as an empirical process that needs the manual tuning of various hyper-parameters like dropout rate, weight decay, and LR. Out of all these hyper-parameters, the LR is of prime importance and has been comprehensively researched in recent works. This work has given the proposal for a novel LR computation method to train DNNs with the ATLBO algorithm. The ATLBO algorithm’s application is done for the process of hyperparameter optimization. Inspired by the process of teaching–learning, this algorithm operates on the effect of a teacher’s influence on the output of learners in a class. The algorithm’s optimal hyperparameters is given as input for the RNN, LSTM and BiLSTM models The BiLSTM CLR is effective for quickly training a model and also has better accuracy in classification. When compared with the BiLSTM CLR, the BiLSTM HTD has superior performance in all the experimentations and also has fewer hyperparameters to tune. When compared with the BiLSTM HTD, the BiLSTM T-HTR has better performance in almost all the cases and also has more flexibility to accomplish better performance. When compared to models using the TLBO algorithm, the average training time for models utilizing the ATLBO approach is very short, indicating that convergence occurs first. Other hyperparameters of the model can be investigated for optimization using the ATLBO algorithm in future work.

The authors received no specific funding for this study.

The authors declare that they have no conflicts of interest to report regarding the present study.