The demand for adopting neural networks in resource-constrained embedded devices is continuously increasing. Quantization is one of the most promising solutions to reduce computational cost and memory storage on embedded devices. In order to reduce the complexity and overhead of deploying neural networks on Integer-only hardware, most current quantization methods use a symmetric quantization mapping strategy to quantize a floating-point neural network into an integer network. However, although symmetric quantization has the advantage of easier implementation, it is sub-optimal for cases where the range could be skewed and not symmetric. This often comes at the cost of lower accuracy. This paper proposed an activation redistribution-based hybrid asymmetric quantization method for neural networks. The proposed method takes data distribution into consideration and can resolve the contradiction between the quantization accuracy and the ease of implementation, balance the trade-off between clipping range and quantization resolution, and thus improve the accuracy of the quantized neural network. The experimental results indicate that the accuracy of the proposed method is 2.02% and 5.52% higher than the traditional symmetric quantization method for classification and detection tasks, respectively. The proposed method paves the way for computationally intensive neural network models to be deployed on devices with limited computing resources. Codes will be available on

Artificial intelligence with deep convolutional neural networks has made significant breakthroughs in many fields, which will be widely used in the aerospace field, such as situational awareness [

This paper is focused on Integer-only quantization for inference. Quantization is a method of quantizing the high-precision parameters of the neural network into low-precision parameters in a finite set, thereby speeding up the computation. High-precision parameters have a more extensive dynamic range, so the 32-bit floating-point data type is usually used in training. After training, in order to reduce the size of the neural network algorithm, the 32-bit floating-point neural network is quantized to an 8-bit or even lower bit integer network.

How to quantize a floating-point network to an integer network requires designing a proper mapping method. Quantization usually results in a loss of accuracy due to information lost. How to improve the accuracy of the quantized neural network considering hardware efficiency is the key problem that needs to be solved. A good quantization mapping method should resolve the two following questions to improve the deployment performance.

The first question is the trade-off between the accuracy of the quantized neural network and the difficulty of deployment and implementation. The simpler the mapping strategy is, the easier and faster the deployment on embedded devices will be, but the loss of accuracy will increase. The more complex the mapping strategy is, the lower the loss of accuracy will be. However, the deployment on embedded devices will be more difficult and result in enormous computational overhead. The commonly used quantization method is symmetric quantization for easy implementation on embedded devices. This method works well only for symmetric distributions, but most distributions of the neural networks are asymmetric.

The second question is the trade-off between range and quantization resolution, which significantly influences quantization parameters’ computation. The larger the clipping range is, the lower the data clipping loss will be. However, the quantization resolution will be lower. The smaller the data clipping range is, the higher the quantization resolution will be, but the data clipping loss will be greater. Range and quantization resolution affect each other, and there is no suitable method to guide how to balance them.

We propose an activation redistribution hybrid asymmetric quantization mapping method for Integer-only inference to resolve these two questions. Our contribution can be listed as follows:

Firstly, we propose a hardware-friendly hybrid asymmetric quantization method for Integer-only inference of neural networks, of which the activation uses asymmetric activation quantization and the weights use symmetric quantization. The proposed method can avoid the additional data-dependent computation, achieve higher accuracy without any computational overhead on embedded accelerators, and resolve the contradiction between the accuracy of the quantized neural network and the ease of deployment and implementation.

Secondly, we introduce an activation redistribution method to compute the quantization parameters achieving lower quantization error. This method has no restrictions on data distribution, and can get the balance between range and quantization resolution.

Most of the existing quantization approaches asymmetric quantization or symmetric quantization [

Symmetric quantization is a simplified version of the general asymmetric case [

On the one hand, different quantization mapping functions are applicable for different data distributions. The data distributions of each layer in the neural network are not same.

On the other hand, the quantization parameters are very important for both asymmetric and symmetric quantization and affect the performance of the quantized neural network. The quantization parameters depend on the clipping range, and the scaling factors divides the given range of real values into a number of partitions. Usually, a series of calibrations are used as the input of a neural network to compute the typical range of activations [

Therefore, the data distribution is not taken consideration in the current one-size-fits-all quantization methods, and there is no guiding principle on how to choose the most suitable method to compute the clipping range, so the current quantization methods cannot adapt to different neural network structures, and perform poorly for tasks with higher accuracy requirement.

We propose an activation redistribution hybrid asymmetric quantization method for Integer-only inference of neural networks with simplicity and efficient implementation to hardware. The activation uses asymmetric activation quantization and the weights use symmetric quantization that avoids the additional data-dependent computation. A neural network usually consists of various layers, including the convolutional layer, the relu layer, the leaky-relu layer, the relu6 layer, the sigmoid layer, the tanh layer, and the FC layer, etc. We propose a hybrid asymmetric quantization method for neural networks and the corresponding method to compute the quantization parameters. For the computationally expensive layers, including the convolutional layer and the FC layer, we propose how to effectively quantize these layers according to the hybrid quantization parameters. For the non-linear layers, such as the relu layer, the leaky-relu layer, the relu6 layer, the sigmoid layer, etc, we propose a quantization template. All the non-linear layers can be quantized according to this template.

In order to take into account the inference speed, accuracy, and convenience of the deployment for a quantized neural network, we propose a hybrid quantization method with asymmetric activation quantization and symmetric weight quantization. So the quantization mapping functions of the activation and weights are:

For the computationally expensive layers, including the convolutional layer and the FC layer, we propose how to effectively quantize these layers according to the hybrid quantization parameters. The quantization of the FC layer is as same as the convolutional layer.

For the non-linear layers, such as the relu layer, the leaky-relu layer, the relu6 layer, the sigmoid layer, etc., we propose a quantization template. All the non-linear layers can be quantized according to this template. The proposed method can achieve higher accuracy without any execution time overhead on embedded accelerators.

How to quantize the convolutional layer needs to be inferred from the computational principles of the convolutional layer. The computation principle of the convolutional layer is:

According to the computation principle of the convolution layer and the proposed hybrid asymmetric quantization strategy, how to quantize the convolutional layer can be inferred. The activations of the convolutional layer (including the input and output) adopt asymmetric quantization mapping, and the weights of the convolutional layer adopt symmetric quantization mapping. The computation principle of the quantization for the convolutional layer is:

The method to quantize the convolutional layer can be divided into 5 steps according to

In step 2,

In step 4, the shift parameter S, multiply parameter MUL, and add parameter ADD are computed according to

This section introduces how to quantize the non-linear layers. We propose a quantization template for the non-linear layers. All the non-linear layers can be quantized according to this template. We introduce how to quantize the relu layer, the leaky-relu layer, the relu6 layer, the sigmoid layer, and the tanh layer according to the proposed quantization template.

The computation principle of the nonlinear layers can be expressed as:

The quantization method of the non-linear layers is based on the lookup table. The proposed quantization template to compute the lookup table for the non-linear layers is:

How to use the proposed quantization template to compute the lookup tables for the relu layer, the leaky-relu layer, the relu6 layer, the sigmoid layer and the tanh layer is shown in

Layer type | Computation principle | Proposed quantization method for non-linear layers |
---|---|---|

relu | if |
if |

leaky-relu | if |
if |

relu6 | if |
if |

sigmoid | ||

tanh |

This section introduces how to compute the hybrid asymmetric quantization parameters. Select several pictures as the calibration set to compute the quantization parameters for the neural network. The method to compute quantization parameters is divided into two steps. The first step is to get the clipping thresholds, and the second step is to compute the quantization parameters according to the clipping thresholds. The clipping thresholds significantly influence quantization parameters’ computation.

The method to compute the clipping thresholds should balance the trade-off between range and quantization resolution. Whether the data clipping thresholds are determined by KL, MSE or other methods between the original real values and the quantized values, there is a problem of wasting the dynamic range of the data. Because these methods are on the premise that data distribution is symmetric. But most of the activation distributions are asymmetric. These methods take the absolute value of the data first when computing the clipping range, and then select the data thresholds by a certain measurement method. The operation of taking the absolute value makes these methods unable to truly reflect the data distribution both in the positive and negative range. The quantization of non-negative activations may be less effective at this point because the clipping range includes values that never appear in the input.

In order to adopt asymmetric activation distributions, and balance the trade-off between range and quantization resolution, we propose an activation redistribution method to compute the clipping thresholds achieving lower quantization error, because this method takes data distribution into consideration. The optimal clipping range for the input is [_{w}. The procedure for computing these clipping thresholds is shown in Algorithm 2 and

As can be seen from the above figure, when the data distribution is not symmetrical around 0, for example, the negative values are small, then the data thresholds determined by the KL divergence are not suitable, because the threshold selected for the negative value area is affected by the positive value, which cannot match the actual data distribution of negative values. The proposed method transforms an asymmetric and skewed activation distribution into a gaussian-like distribution, then get the clipping thresholds by KL divergence, and finally gets the final clipping range by the inverse transformation.

How to compute the quantization parameters according to the clipping thresholds is as follows. The quantization parameters

The purpose of the experiments is to verify the effectiveness of the proposed hybrid asymmetric Integer-only quantization method.

The neural networks adopted in the experiments are the image classification models and the small target detection model. All of the neural networks are quantized to INT8.

Firstly, the experiments are implemented on the TIANJI NPU3.0 neural network accelerator proposed by Xi’an Microelectronics Technology Institute [

Secondly, we compare the proposed method with PyTorch and NNI [

The dataset for image classification application is ImageNet. ImageNet is an image database organized according to the WordNet hierarchy, in which each node of the hierarchy is depicted by hundreds and thousands of images. The dataset has been instrumental in advancing computer vision and deep learning research.

The dataset for is small target detection application is HRSID. HRSID is a dataset for ship detection, semantic segmentation, and instance segmentation tasks in high-resolution SAR images. The dataset contains 5604 SAR images with resolutions of 0.5, 1, and 3 m.

In order to verify the accuracy of quantification methods extensively, we evaluate two aspects of quantization errors. One is the quantization error of a particular layer. The second is the overall quantization error of a model.

For the first aspect of quantization error, there are no ideal metrics that can perfectly measure the quantization error. Different metrics reflect the quantization error from different points. We adopt the following three metrics to measure the quantization error, including Manhattan distance, Euclidean distance, and Signal to Noise Ratio. The range of Manhattan distance and Euclidean distance is 0 to +∞, and the range of Signal to Noise Ratio is −∞ to +∞. The smaller the Manhattan distance and the Euclidean distance are, the lower the error will be. The higher the signal-to-noise ratio is, the lower the error will be.

Manhattan distance (sum of the absolute values of the difference between the original real values and the corresponding floating-point values after quantization):

Euclidean distance (the square root of the sum of the square of the difference between the original real values and the corresponding floating-point values after quantization):

Signal to Noise Ratio:

For the second aspect of quantization error, the evaluation metrics are the accuracy metrics of that model. For image classification application, we use Top-1 Accuracy (the one with the highest probability must be exactly the expected answer). For small target detection application, we use mAP (Mean Average Precision). The calculation of mAP is the same as in the internationally renowned target detection competition PASCAL VOC Challenge.

Firstly, we use a traditional symmetric quantization method as a baseline. This method adopts symmetric quantization for both activation and weights, with the clipping range determined by KL divergence. This method is adopted by most embedded neural network accelerators, such as Nvidia’s TensorRT [

Secondly, as a baseline, we compare the proposed method with PyTorch and NNI on PC. PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 computing. PyTorch supports multiple approaches to quantize a deep learning model. In most cases, the model is trained in FP32 and then the model is converted to INT8. In addition, there are three functions in PyTorch to compute the clipping range. The torch.quantization.observer module in Pytorch integrates three calibration strategies, including MinMaxObserver, MovingAverageMinMaxObserver, and HistogramObserver. There is no guide on how to get the most suitable strategy. The easiest way (and the default option in Pytorch) is to directly take the minimum and maximum values by the MinMaxObserver function. The method in NNI to compute the clipping range is also to take the minimum and maximum values.

Classification Application

The models used for image classification are GoogleNet, MobileNetV2, and VGG16. For fair comparison and ease of reproducibility, we use well-trained models on the ImageNet dataset. For image classification application, we test Top-1 Accuracy and FPS (How many frames can be processed per second).

TIANJI NPU3.0 accelerator runs at a frequency of 200M. The resources consumption on FPGA of TIANJI NPU3.0 is shown in

Total resources | Consumption | Consumption percentage | |
---|---|---|---|

LUT | 274080 | 186174 | 67.93% |

FlipFlops | 548160 | 167977 | 60.64% |

Block RAMs | 912 | 547 | 59.98% |

DSPs | 2520 | 2048 | 81.27% |

The results for the classification application on TIANJI NPU3.0 are shown in

Model | PC accuracy |
Traditional symmetric quantization |
Proposed hybrid asymmetric quantization (INT 8) | ||||
---|---|---|---|---|---|---|---|

Top-1 accuracy | FPS | Power | Top-1 accuracy | FPS | Power | ||

GoogleNet | 67.04% | 65.91% | 36.63 | 14.44 W | 36.63 | 14.44 W | |

MobileNetV2 | 70.24% | 61.54% | 38.02 | 14.48 W | 38.02 | 14.48 W | |

VGG16 | 66.13% | 62.53% | 9.65 | 14.72 W | 9.65 | 14.72 W |

The results for the classification application on MLU220 are shown in

Model | PC accuracy |
Traditional symmetric quantization |
Proposed activation redistribution method (Only convolutional and FC layers are INT8) |
---|---|---|---|

Top-1 accuracy | Top-1 accuracy | ||

GoogleNet | 67.04% | 66.97% | |

MobileNetV2 | 70.24% | 69.88% | |

VGG16 | 66.13% | 65.83% |

Small Target Detection Application

The small target detection task is very challenging because the loss of accuracy is very sensitive to quantization. The model we choose for this small target detection task is Yolo-v3 tiny, a typical object detection model that has been widely adopted.

The experimental results measuring the quantization error of convolutional layers and relu layers for the small target detection application on TIANJI NPU3.0 are shown in

Methods to measure quantization error | Symmetric quantization error | Proposed hybrid asymmetric quantization error |
---|---|---|

Manhattan distance | 3.2468 | 0.3617 |

Euclidean distance | 2.1446 | 0.2723 |

Signal to Noise Ratio | 41.1668 | 59.0920 |

Methods to measure quantization error | Symmetric quantization error | Proposed hybrid asymmetric quantization error |
---|---|---|

Manhattan distance | 1177.0 | 823.5 |

Euclidean distance | 4.0216 | 2.8923 |

Signal to Noise Ratio | 38.6908 | 41.0279 |

The speed and accuracy experimental results for the small target detection application on TIANJI NPU3.0 are shown in

Model | PC accuracy |
Traditional symmetric quantization (INT 8) | Proposed hybrid asymmetric quantization (INT 8) | ||||
---|---|---|---|---|---|---|---|

mAP | FPS | Power | mAP | FPS | Power | ||

Yolo-v3 tiny | 90.70% | 80.12% | 17.89 | 14.48 W | 17.89 | 14.48 W |

The results for the small target detection application on MLU220 are shown in

Model | PC accuracy |
Traditional symmetric quantization |
Proposed activation redistribution method |
---|---|---|---|

mAP | mAP | ||

Yolo-v3 tiny | 90.70% | 82.67% |

Classification Application

The models used for image classification to compare with PyTorch and NNI are the same as

The results for the classification application are shown in

Model | PC accuracy |
NNI accuracy |
PyTorch accuracy |
Proposed hybrid asymmetric quantization accuracy |
|||
---|---|---|---|---|---|---|---|

MinMax | Moving |
Histogram | Proposed activation redistribution | ||||

GoogleNet | 67.04% | 66.74% | 66.50% | 66.82% | 66.82% | 66.77% | |

MobileNetV2 | 70.24% | 66.50% | 66.82% | 66.23% | 66.60% | 66.99% | |

VGG16 | 66.13% | 65.20% | 65.22% | 65.12% | 66.20% | 65.31% |

Small Target Detection Application

The model used for small target detection to compare with PyTorch is the same as

The results for small target detection model application are shown in

Model | PC accuracy |
NNI accuracy |
PyTorch accuracy |
Proposed hybrid asymmetric quantization accuracy |
|||
---|---|---|---|---|---|---|---|

MinMax | Moving |
Histogram | Proposed activation redistribution | ||||

Yolo-v3 tiny | 90.70% | 85.7% | 83.2% | 86.7% | 85.6% | 86.1% |

We propose an activation redistribution hybrid asymmetric quantization method for Integer-only inference of neural networks. This method is suitable for both symmetric distributions and asymmetric distributions. When the proposed hybrid asymmetric Integer-only quantization method is applied to classification models, we can achieve an average accuracy improvement up to 2.02% compared with the traditional symmetric quantization method. When the proposed hybrid asymmetric Integer-only quantization method is applied to Yolo-v3 tiny model for detection, the accuracy improvement is 5.52% compared with the traditional symmetric quantization method. So, our method can make the neural networks quickly and easily deployed on the resource-constrained embedded devices.

For further work, we believe that making the distribution more friendly to quantization is a promising research direction to improve the quantization performance further.

The Authors acknowledge the support received from the Qian Xuesen Youth Innovation Foundation of China Aerospace Science and Technology Corporation under grant 2022JY51.

The Qian Xuesen Youth Innovation Foundation from China Aerospace Science and Technology Corporation (Grant Number 2022JY51).

The authors confirm contribution to the paper as follows: study conception and design: Lu Wei, Zhong Ma; data collection and experiment: Chaojie Yang; analysis and interpretation of results: Lu Wei, Chaojie Yang; draft manuscript preparation: Lu Wei, Zhong Ma. All authors reviewed the results and approved the final version of the manuscript.

The data that support the findings of this study are available from the accessible website

The authors declare that they have no conflicts of interest to report regarding the present study.