In the field of natural language processing (NLP), there have been various pre-training language models in recent years, with question answering systems gaining significant attention. However, as algorithms, data, and computing power advance, the issue of increasingly larger models and a growing number of parameters has surfaced. Consequently, model training has become more costly and less efficient. To enhance the efficiency and accuracy of the training process while reducing the model volume, this paper proposes a first-order pruning model PAL-BERT based on the ALBERT model according to the characteristics of question-answering (QA) system and language model. Firstly, a first-order network pruning method based on the ALBERT model is designed, and the PAL-BERT model is formed. Then, the parameter optimization strategy of the PAL-BERT model is formulated, and the Mish function was used as an activation function instead of ReLU to improve the performance. Finally, after comparison experiments with traditional deep learning models TextCNN and BiLSTM, it is confirmed that PAL-BERT is a pruning model compression method that can significantly reduce training time and optimize training efficiency. Compared with traditional models, PAL-BERT significantly improves the NLP task’s performance.

With the rapid growth of big data, the volume of complex network information is increasing exponentially. As a result, people are increasingly relying on search engines to retrieve relevant information efficiently. However, traditional search engine algorithms are no longer capable of adequately meeting user demands in the face of this overwhelming data volume. Therefore, there is a growing need for more efficient and effective information retrieval methods. The question-answering system (QA) [

For a long time, the development of deep learning in NLP has been far from its performance in image processing. While image recognition algorithms [

Pruning [

The pruning process is shown in

Many experts and scholars have studied various pruning methods. EvoPruneDeepTL [

The traditional zero-order network pruning (setting a threshold for the absolute value of the parameters in the model, which is higher than its retention and lower than its zero setting) is not suitable for the migration learning scenario, because in this scenario, the model parameters are mainly affected by the original model, but need to be fine-tuned and tested on the target task. Therefore, pruning directly according to the model parameters may lose the knowledge from either source or target tasks.

The model used in this paper is based on ALBERT [

This study optimizes the ALBERT model through pruning technology to improve its performance in question and answer tasks. The innovation points are mainly reflected in the following aspects:

ALBERT is introduced as the research basis. It is a more efficient and compact variant of the BERT model. It solves BERT’s problems in computing and memory requirements through features such as parameter sharing, decomposition of embedded parameters, and Sentence Order Prediction. For low-power environments, a special training optimization strategy is designed, including gradient accumulation and 16-bit quantization, to improve training efficiency. PAl-BERT was introduced as the first derivative pruning model to reduce computational requirements and improve model efficiency by retaining parameters far away from zero during the fine-tuning process. Through these innovations, the article attempts to fill some shortcomings in deep learning applications in the field of natural language processing, proposes a more efficient and compact model, and further improves performance through pruning technology. This series of innovations and improvements make the article uniquely valuable in optimizing question and answer systems.

In our experimental settings, we follow ALBERT’s original settings and use the SQuAD 1.1 and SQuAD 2.0 datasets as pre-training experimental datasets. We select TextCNN and BiLSTM as comparative models. Through experimental comparisons and analysis in the dataset CMRC 2018, AP-BERT achieves an accuracy rate of 81.5% in question-answering tasks, demonstrating superior performance to baseline models. Experimental results verify the effectiveness of the proposed model.

In this study, SQuAD 1.1 and SQuAD 2.0 [

SQuAD 1.1 | SQuAD 2.0 | |
---|---|---|

Total samples | 87599 | 130319 |

Negative samples | 0 | 43498 |

Total articles | 442 | 442 |

Articles with negatives | 0 | 285 |

Total samples | 10570 | 11873 |

Negative samples | 0 | 5945 |

Total articles | 48 | 35 |

Articles with negatives | 0 | 35 |

Total samples | 10570 | 11873 |

Negative samples | 0 | 5945 |

Total articles | 48 | 35 |

Articles with negatives | 0 | 35 |

Various studies based on the old version of datasets have been well performed. Therefore, the new version of datasets introduces manually marked “unanswerable” questions, thus increasing the difficulty of the whole task.

For another experimental dataset, we will use the Chinese machine reading comprehension dataset CMRC 2018 [

Train | Development | Test | Challenge | |
---|---|---|---|---|

Number of questions | 10321 | 3351 | 4895 | 504 |

Average answers per question | 1 | 3 | 3 | 3 |

Maximum article characters | 962 | 961 | 980 | 916 |

Maximum question characters | 89 | 56 | 50 | 47 |

Maximum answer characters | 100 | 85 | 92 | 77 |

Average article characters | 452 | 469 | 472 | 464 |

Average question characters | 15 | 15 | 15 | 18 |

Average answer characters | 17 | 9 | 9 | 19 |

However, there are still some differences between Chinese and English datasets, so it is also used as a supplement to the experimental dataset to explore the differences in model and preprocessing in non-English cases. Each article will give several relevant questions, and each question has several manually marked reference answers. Note that each reference answer is considered correct during the evaluation. To confirm the diversity of problems, the dataset includes six common types of problems and other types except common problems. The statistical table of problem types is the same as that in

Question type | Percentage |
---|---|

When | 12.8% |

Where | 12.3% |

Who | 8.6% |

What | 7.8% |

Why | 5.7% |

How | 1.2% |

Others | 51.4% |

ALBERT is a more efficient version of BERT, and it makes improvements in three main areas:

Embedded Layer Restructuring:

In BERT, the size of the word embedding layer matches that of the hidden layers. ALBERT, however, suggests a change. It argues that the embedded layer primarily holds context-free information while the hidden layers add context. So, the hidden layer should have a higher dimension. To avoid increasing the parameters in the embedded layer too much when increasing the hidden layer dimension, ALBERT decomposes the embedded layer and introduces an extra embedded layer.

Cross-Layer Parameter Sharing:

ALBERT uses a mechanism where all layers share parameters to enhance efficiency. While other methods exist for sharing parameters within specific parts of the model, ALBERT shares all parameters across all layers. This results in smoother transitions between layers, suggesting that parameter sharing improves the stability of the model.

Sentence Order Prediction (SOP) Task:

BERT introduced Next Sentence Prediction (NSP) to improve downstream tasks using sentence pairs. However, ALBERT introduces an alternative task called Sentence Order Prediction (SOP). In SOP, two consecutive paragraphs from a single document are used as a positive sample, and a negative sample is created by switching their order. ALBERT argues that NSP is less effective than SOP because it’s relatively easier. ALBERT combines topic prediction and coherence prediction into a single task, which helps maintain high scores even if NSP struggles with coherence prediction.

Based on ALBERT model, the model built in this section optimization and pruning are carried out to design a model suitable for QA tasks.

In this study, all experimental GPUs are equipped with NVIDIA RTX2060 graphics cards. While these cards offer decent computing power, they may fall short when handling the ALBERT model. Consequently, this section introduces two optimization strategies aimed at reducing the hardware demands for model training. To attain satisfactory training results on platforms with limited computing power, we employ the gradient accumulation method and 16-bit precision training.

The first optimization strategy is gradient accumulation. Take Pytorch as an example. In traditional training of neural networks, Pytorch calculates the gradient after each ‘backward ()’, and the gradient will not be cleared automatically. If it is not cleared manually, the gradient will continue to accumulate until “CUDA out of memory” appears and an error is reported. Generally, the process is shown in

From the

1) Reset gradient to 0 after the previous batch calculation.

2) Forward propagation, input data in the network to obtain the predicted value.

3) Calculate the loss value according to the predicted value and label.

4) Calculate the parameter gradient by backpropagation through loss.

5) Update the network parameters by the gradient calculated in the previous step.

The gradient accumulation method is to obtain one batch at a time, calculate the gradient once, and continuously accumulate without clearing. As shown in

1) Forward propagation, input data in the network to obtain the predicted value.

2) Calculate the loss value according to the predicted value and label.

3) Standardize the loss function.

4) Calculate the parameter gradient by back propagation through loss.

5) Repeat steps 1 to 4 to accumulate the gradient instead of resetting.

6) After the gradient accumulation reaches a fixed number of times, update the parameters, and reset the gradient to 0.

Gradient accumulation expands the video memory in a disguised way. For example, if the gradient accumulation is set to 6, it can be approximately regarded as expanding the video memory by 6 times. Although the effect will be worse than the actual 6 times if the experimental conditions are not abundant, the gradient accumulation law has high-cost performance.

The second optimization strategy is 16-bit accuracy training [

Conventionally, models have been trained with 32-bit precision numbers. However, recent research has elucidated that employing 16-bit precision models can yield commendable training results. The foremost rationale for embracing 16-bit precision resides in its capacity to mitigate the memory demands imposed by model parameters and intermediate activations. The computational efficiency of 16-bit arithmetic operations, particularly on GPU hardware, is an instrumental factor. The salient characteristic of 16-bit arithmetic operations is their enhanced parallelizability, resulting in expedited computations. Training deep learning models with 16-bit precision bears the potential to significantly expedite the training process, predominantly by curtailing the temporal overhead incurred in memory transfers and arithmetic operations.

The reduction in precision not only facilitates swifter computations but also leads to a reduction in memory footprint. Storing model parameters and interim activations in a 16-bit format consumes only half the memory compared to 32-bit precision. In numerous cases, models trained with 16-bit precision have demonstrated performance on par with models trained with 32-bit precision. Furthermore, the advantages of 16-bit precision extend to model deployment, particularly in contexts marked by resource constraints, such as edge devices or mobile platforms.

Practically implementing 16-bit precision mandates the installation of the Apex library, an indispensable tool within the PyTorch framework, thoughtfully crafted by NVIDIA. Subsequently, a conscientious transition to 16-bit precision ensues, necessitating adjustments to the data types employed for model weights, activations, and gradients–a transition from ‘float32’ to ‘float16’. To address potential numerical instability issues, Automatic Mixed Precision (AMP) is employed to dynamically manage the precision during forward and backward passes.

Traditional zeroth-order network pruning, which involves setting a threshold for the absolute values of model parameters and retaining those above it while zeroing out those below it, is not suitable for transfer learning scenarios. In transfer learning, model parameters are primarily influenced by the original model but require fine-tuning and testing on the target task. Therefore, directly pruning based on the model parameters themselves may result in the loss of knowledge from either the source task or the target task.

However, first derivative pruning relies on gradients calculated during training to identify and remove less important model parameters. This approach tends to preserve task-specific information better than zero-order pruning. By targeting specific parameters, first-order pruning can produce smaller, more efficient models, making them suitable for deployment in resource-constrained environments.

For the reasons mentioned above, this paper proposes a first-order derivative pruning model for ALBERT, named AP-BERT (where ‘A’ stands for ALBERT and ‘P’ stands for Pruning). This model aims to retain parameters that deviate further from zero during the fine-tuning process to mitigate the loss of knowledge in transfer learning scenarios.

Specifically: for model parameters

In the forward propagation process, the neural network uses the parameters with mask to calculate the output components

In back propagation, using the idea of Straight-Through Estimator [

For model parameters, there is shown as

Combining the above two equations, omitting the mask matrix of 0 and 1, we can get

According to the gradient decrease, when

To make the established PAL-BERT achieve the best effect in question-and-answer tasks, it is also necessary to select and test its specific internal parameters to find out the optimal scheme. It is divided into two aspects.

1. Selection of different layers. The official ALBERT_ large model contains a total of 24 layers of coding structure, but this structure is like a black box. It is certain that each layer will contain different semantic information, but we have not decided what the specific semantic information contained in each layer is. Therefore, in order to find the most effective layer or combination, we need to analyze the different layers of PAL-BERT.

2. Avoid overfitting during training. It is necessary to select an optimizer which has an appropriate learning rate to train PAL-BERT model.

The lower layer of ALBERT always has more information, while the upper layer may contain less information. In order to adapt to this situation, it is necessary to select different training rates for different layers.

The iterative method for the parameters of each layer can be shown in

ReLU is one of the most widely used activation functions. ReLU is computationally efficient and helps alleviate the vanishing gradient problem, making it suitable for training deep neural networks. It is defined as

However, the ReLU function still exhibits several drawbacks. Firstly, the output of ReLU is not zero-centered. Secondly, it suffers from the Dead ReLU Problem, where ReLU neurons become inactive in the negative input region, reducing their responsiveness during training. When x < 0, the gradient remains permanently at 0, causing the affected neuron and subsequent neurons to remain unresponsive, and the corresponding parameters are never updated.

In this section, we propose the adoption of the Mish activation function in place of ReLU within the fully connected layers to enhance the model’s performance. Mish offers several advantages over ReLU, making it a promising choice. It lacks an upper limit on positive values, avoiding issues related to saturation and maintaining a smoother gradient. These characteristics contribute to improved gradient flow, mitigate dead neuron problems, and ultimately result in enhanced accuracy and generalization within deep neural networks when compared to ReLU.

The image of Mish function is shown in

Compared to ReLU, Mish allows positive values to reach higher levels without encountering an upper boundary, thus preventing saturation issues associated with limit constraints [

The specific experimental parameters are shown in

Experimental environment | Configuration details |
---|---|

CPU | Intel(R) Core(TM) i7-9750H CPU @ 2.60 GHz 2.59 GHz |

Memory | 16 G |

Graphics card | GeForce RTX 2060 8 G |

Operating system | Ubuntu18.04 |

Programing language | Python3.7.5 |

Deep learning framework | Pytorch1.3 |

When rating the classification effect, the commonly used Accuracy, Recall and F1 value are used.

Accuracy indicates the proportion of all correct predictions (positive and negative). Recall indicates the proportion of accurately predicted positives to the total number of actual positives. F1 value is the arithmetic mean divided by the geometric mean.

The ALBERT model used in the experiment in this section is ALBERT_large_zh [

Dropout_prob | 0.0 |
---|---|

Hidden_act | “GELU” |

Hidden_size | 1024 |

Embedding_size | 128 |

Initializer_range | 0.02 |

Intermediate | 4096 |

Max_position_embeddings | 512 |

Num_attention_heads | 16 |

Num_hidden_layers | 24 |

Type_vocab_size | 2 |

Vocab_size | 21128 |

Although layer normalization is used in the internal structure of ALBERT. But in fact, the size of the batch will still affect the model’s accuracy.

Batch = 8 | Batch = 16 | Batch = 32 | |
---|---|---|---|

FP16 | 0.759 | 0.787 | 0.816 |

FP32 | 0.762 | 0.789 | —— |

Among these, ‘FP16’ indicates the utilization of mixed-precision computing, while ‘FP32’ signifies not using mixed-precision computing. As shown in the table, different batch sizes of 8, 16, and 32 lead to varying final accuracy results. The best performance is achieved with a batch size of 32. One possible explanation for this is the gradient update stage. With a batch size of 32, the average loss of 32 samples serves as the loss function, and the gradients of this average loss are used for parameter updates. This suggests that if the batch size is too small, it may get stuck in local optima, causing results to oscillate without converging to the global optimum. Therefore, for PAL-BERT training, larger batch size is preferable, albeit it demands higher hardware capabilities.

In the table, when the batch size is set to 32 without using mixed-precision computing, data loss occurs due to GPU memory overflow, rendering computations impossible. However, mixed-precision computing resolves this issue. Experimental results indicate that adopting mixed-precision training significantly enhances GPU training capacity. It is worth noting that, despite NVIDIA’s official statement that its Apex mixed-precision training method does not affect model performance, during PAL-BERT training, there is a slight reduction in final accuracy when using mixed-precision training. Nevertheless, the increase in the trainable batch size resulting from this method more than compensates for its minor adverse effects. Therefore, the use of mixed-precision training remains essential during training.

The actual effect of the gradient accumulation method can be seen as a disguised increase in the size of the batch. The calculation results in the table are the conclusions obtained when the gradient accumulation times are 2. When batch = 32, the calculation results of accuracy for different gradient accumulation times are shown in

Gradient accumulation times | Accuracy |
---|---|

1 | 0.794 |

3 | 0.816 |

4 | 0.813 |

As shown in

Learning rate | Attenuation coefficient ε | Accuracy |
---|---|---|

2.0e^{−5} |
1.00 | 0.813 |

^{−5} |
||

2.0e^{−5} |
0.90 | 0.802 |

2.0e^{−5} |
0.85 | 0.783 |

2.5e^{−5} |
1.00 | 0.775 |

2.5e^{−5} |
0.95 | 0.781 |

2.5e^{−5} |
0.90 | 0.792 |

2.5e^{−5} |
0.85 | 0.778 |

When the initial learning rate is high, the decline rate should be relatively low. Because the deep model can learn less, it needs a relatively low learning rate for fitting. Through comparison, it can be found that the trained model has the highest accuracy when the attenuation coefficient is 0.95 and the learning rate is 2.0e^{−5}.

To compare PAL-BERT’s performance with other models, this section introduces two additional commonly used models: TextCNN [

Model | Accuracy | Recall | F1 |
---|---|---|---|

PAL-BERT | 0.815 | 0.819 | 0.796 |

TextCNN-G | 0.768 | 0.781 | 0.762 |

BiLSTM-G | 0.776 | 0.784 | 0.774 |

TextCNN-W | 0.763 | 0.798 | 0.752 |

BiLSTM-W | 0.767 | 0.770 | 0.761 |

TextCNN-G and BiLSTM-W denote the usage of GloVe as the word vector for vectorization, while TextCNN-W and BiLSTM-W indicate the utilization of Word2Vec for vectorization.

For the four models using the combination of traditional neural network and word vector, BiLSTM-G has the best effect, but its result is still about 4% worse than PAL-BERT. It shows that PAL-BERT can perform QA tasks better than the traditional model.

The NLP model based on pre-training technology represented by the BERT model has led to the emergence of a number of models, including 3 directions: 1) Larger capacity, more training data and parameter models, such as RoBERTa [

This study designs a first order pruning model PAL-BERT based on ALBERT and explores the impact of different parameter adjustment strategies on model performance through experiments. The original ALBERT_large_zh model had a size of 64 M, but after gradient descent, 16-bit training, and first-order pruning, the model size was reduced to only 19 M. While sacrificing some accuracy, with the F1 score dropping from 0.878 to 0.796, this reduction in model size enabled it to be deployed and trained effectively on low-computing platforms.

A comparative experiment with traditional deep learning models TextCNN and BiLSTM was designed. The results demonstrate the feasibility of the method proposed in this article and provide a solution for the knowledge question and answer model on a low computing power platform. Although PAL-BERT has shown superior performance in experiments, it may still have some potential shortcomings or limitations compared to traditional models.

Pruning methods involve the choice of some hyperparameters. For PAL-BERT, these hyperparameters may require complex tuning depending on the task and the dataset. PAL-BERT’s pruning process aims to reduce the size of the model, but this is often accompanied by a certain degree of information loss. Compared with traditional models, PAL-BERT may require a trade-off between model efficiency and information retention. The performance improvement of PAL-BERT may be more dependent on the nature of the specific task and the quality of the fine-tuning data set. This may result in PAL-BERT’s generalization being not as good as traditional models.

However, the methods proposed in this paper have certain limitations. In our training optimization approach, we have opted for gradient descent and 16-bit precision training to address the issue of insufficient computing power. Nevertheless, in the actual training process, the performance of gradient descent is highly sensitive to the choice of the learning rate. Setting it too high may lead to divergence, while setting it too low may result in slow convergence or getting trapped in local minima. Therefore, determining the optimal learning rate for a given dataset remains an unresolved challenge. Furthermore, for 16-bit precision training, the increase in computational efficiency comes at the cost of reduced precision. The impact of this precision reduction may become evident in specific training tasks, and its practicality in models with more complex word embedding still requires further investigation.

This paper studies the question answering technology based on the pre-training model, and proposes a pruning model compression method, which successfully shortens the training time and improves the training efficiency. We have improved and optimized the Albert structure in the article. When adapting to specific Q&A tasks, we propose a new model that can give full play to its performance in Q&A tasks.

Firstly, this paper introduces an improved model of ALBERT based on BERT. Its improvement on the BERT model mainly includes the following three aspects: embedded layer decomposition, cross-layer parameter sharing, and SOP sentence order prediction task.

Secondly, two optimization strategies for model training are proposed, namely gradient accumulation and 16-bit precision training.

Thirdly, this paper proposes a first-order derivative pruning model, PAL-BERT for ALBERT. Through the comparative experiment with the traditional deep learning models TextCNN and BiLSTM, this paper explores the impact of different batch sizes and mixing accuracy on the model’s performance. The experimental results show that the effect is the best when the gradient is accumulated 2 or 3 times, and there is little difference between them. Given the significant increase in computation time when accumulating three times, it is advisable to limit the accumulation to two times.; In addition, by comparing the effects of different training rates on the PAL-BERT model, it can be found that the trained model has the highest accuracy when the attenuation coefficient is 0.95 and the learning rate is 2.0e^{−5}. For the four models using the combination of traditional neural network and word vector, BiLSTM-G has the best effect, but its result is still about 4% worse than PAL-BERT. It shows that ALBERT significantly improves the performance of natural language processing tasks compared with the traditional deep learning model.

Not applicable.

Supported by Sichuan Science and Technology Program (2021YFQ0003, 2023YFSY0026, 2023YFH0004).

The authors confirm contribution to the paper as follows: study conception and design: Wenfeng Zheng; data collection: Siyu Lu, Ruiyang Wang; software: Zhuohang Cai; analysis and interpretation of results: Wenfeng Zheng, Lei Wang; draft manuscript preparation: Wenfeng Zheng, Siyu Lu, Lirong Yin. All authors reviewed the results and approved the final version of the manuscript.

SQuAD can be obtained from:

The authors declare that they have no conflicts of interest to report regarding the present study.