Convolutional Neural Network (ConNN) implementations on Field Programmable Gate Array (FPGA) are being studied since the computational capabilities of FPGA have been improved recently. Model compression is required to enable ConNN deployment on resource-constrained FPGA devices. Logarithmic quantization is one of the efficient compression methods that can compress a model to very low bit-width without significant deterioration in performance. It is also hardware-friendly by using bitwise operations for multiplication. However, the logarithmic suffers from low resolution at high inputs due to exponential properties. Therefore, we propose a modified logarithmic quantization method with a fine resolution to compress a neural network model. In experiments, quantized models achieve a negligible loss of accuracy without the need for retraining steps. Besides this, we propose a resource-efficient hardware accelerator for running ConNN inference. Our design completely eliminates multipliers with bit shifters and adders. Throughput is measured in Giga Operation Per Second (GOP/s). The hardware utilization efficiency is represented by GOP/s per block of Digital Signal Processing (DSP) and Look-up Tables (LUTs). The result shows that the accelerator achieves resource efficiency of 9.38 GOP/s/DSP and 3.33 GOP/s/kLUTs.

Edge devices such as low-end FPGA have limited resources, but ConNN models require large amounts of multipliers and extensive storage [

However, compressing a model from 32-bits to lower precision suffers accuracy degradation. The quantized model needs to be retrained to recover accuracy close to the original accuracy. But the retraining process can be time-consuming and require additional computations. Moreover, some of the original training datasets are not accessible due to privacy or ownership. For example, the data in healthcare systems and user data in applications collected by companies.

A Post-training scheme has been introduced to address the issue by quantizing the model without retraining [

For this reason, we investigate a modified logarithmic quantization to overcome the issues mentioned above. We then propose an efficient hardware architecture for running model inference on resource-constrained device. The main contributions of this work consist of following features:

We present a modified logarithmic quantization for compressing a ConNN model to very low bit-width. We use different quantization methods for weights and activations to fit different distributions. Our design achieves to convert models to lower bit-width without the retraining.

We also propose a resource-efficient accelerator for running the ConNN inference on Zynq XC7Z020. The proposed processor uses only adders and shifters for computing convolution. Experiment results show that our work achieves resource efficiency of 9.38 GOP/s/DSP and 3.33 GOP/s/kLUTs.

This section provides a review of related work in literature. Quantization has been proposed in many works to address the computational complexity and high memory requirement issues of neural network model by compressing an original 32-bit model to a lower precision. Logarithmic quantization is one of the efficient methods that can compress models to very low bit-width precision without significant deterioration in performance. Logarithmic quantization was proposed in [

To improve the quantization levels, the logarithmic base-

The logarithmic scheme is non-uniform quantization with a low resolution at high-level input. To address this issue [

Several works have shown that the weights in layers of the neural networks follow a bell-shaped and long-tailed distribution. A large number of weights are around the mean and a few of weights have high absolute values. This is well motivated to use a logarithmic quantization because it has a high number of quantization levels around the mean. In other words, the logarithmic quantization provides high resolution at low-level input. Due to unequal quantization levels, logarithmic scheme on the other hand has a low resolution at high input. In the logarithmic scheme, quantized values are represented using integers in the log domain. The step-size increases exponentially from low to high-level input. Although the distribution of weights indicates that a few weights have a high value, but high value weights can have a strong connection and have a high impact on accuracy.

To address the unbalanced quantization levels issue, we propose a modified logarithmic quantization that uses real numbers for quantization levels in the log domain. This method reduces the spacing between integers in log domain. Therefore, it provides a greater number of quantization levels at high inputs. Since logarithmic method has a high resolution at low input, we therefore continue to use integers for quantization levels at low-level input. In this way, our method provides high resolution across all input levels.

This approach is mostly used in previous work to map a 32-bit floating-point to a lower bit-width such as binary, ternary, and fixed-point formats. This method is uniform quantization which the quantization levels are equally spaced. The quantization parameters are full scale range (FSR) and word length (W). The fixed-point quantization function (

where the clip function (Clip(.)) is defined as:

Fixed-point format is denoted by ap_fixed (W, FL), where W is the total bit-width including an integer-bit (IL) and a fractional bit (FL). To quantize 32-bit floating-point to fixed-point format using linear quantization, the parameters are defined as:

The quality of the quantization depends upon the number of bit-width. Increasing the bit-width linearly increases the number of quantization level and gives a higher resolution. However, it requires more memory space, and the cost of computation is expensive. On the other hand, using the lower bit-width takes less memory but the accuracy degradation is a trade-off.

This method converts a floating-point number to lower bit-width by mapping a number in log domain. Logarithmic quantization is non-uniform quantization in which the quantization levels are unequal and involves the logarithm to represent a quantized value. The logarithmic quantization parameters are full scale range and total bit-width (W). With w-bit representation, the logarithmic quantization function (

where

Unlike fixed-point quantization, the logarithmic scheme gives a high number of quantization levels (high resolution) around the mean and lower resolution for other regions. The weights in a layer of the neural network can be considered a bell-shaped distribution. Therefore, the logarithmic quantization performs better than the fixed-point scheme when quantizing using very low bit-width. Moreover, the advantage of logarithmic quantization is that it is hardware-friendly as multiplications can be replaced with a bit-shift operation. For instance, given quantized weights (

where

Applying logarithmic quantization directly to weights can result in a loss of accuracy. Obviously, the weights in the layers of the neural network have different effects on accuracy. Larger absolute values affect accuracy more than smaller ones. The disadvantage of logarithmic scheme is the use of integer format in log domain. Because of the exponential property, the number of quantization levels decreases for high-value inputs. In other words, logarithmic scheme does not work well when absolute input is large. To address the quantization levels issue, we take the inspiration from this work [

where

The quantized weight

where

where M =

The idea of Qset is to represent the quantized value using real numbers or integers. Absolute weights greater than s takes real numbers for the representation levels, otherwise the quantized value is represented by integers. In the set p1, the interval starts from 0 to

The parameter S is used to define the range of set p1. In our experiments, we set S to 0.01 to provide a high resolution quantization levels for absolute weights of 1 to 0.01. The parameter R is a positive real number and is used to set the step-size of representation levels. A larger R provides a wider representation range but will reduce the resolution. Parameter R must be greater than

There are challenges in applying logarithmic quantization to activations when the activations of each layer has a different shape of the distribution. Using the quantization with Qset may lead to sub optimal because the activations value is not constant and depend on the input. Thus, we cannot determine the optimal parameter R for the range of Qset. For this reason, we use another method for quantizing activations. We adopt this [

where

The quantization step

This section presents the performance of quantization method. The goal of quantization is to compress values into smaller bit-width with less distortion. We use the signal-to-quantization-noise (SQNR) to evaluate each quantization method. Let

where

The performance of quantization for activations is shown in

We evaluate the proposed quantization method on different models and datasets. To present the improvement, we apply only naive logarithmic method and the proposed logarithmic quantization to weights. We also implement naive logarithmic and fixed-logarithmic quantization for activations.

For the weights compression, using LogQ(w) with Flog(a) can increase almost 2

Model | W-bits | Log (w, a) | LogQ(w), Log(a) | LogQ(w), Flog(a) |
---|---|---|---|---|

ResNet-18 |
6 | 36.29 | 50.92 | 69.44 |

5 | 36.29 | 49.65 | 68.57 | |

4 | 32.46 | 40.79 | 60.52 | |

ResNet-34 |
6 | 33.52 | 49.95 | 71.11 |

5 | 33.52 | 47 | 69.96 | |

4 | 29.8 | 37.95 | 59.97 | |

AlexNet |
6 | 44.7 | 54.77 | 56.4 |

5 | 44.6 | 54.02 | 56.4 | |

4 | 40.81 | 50.83 | 53.34 | |

Tiny YOLOv2* |
6 | 32.4 | 41.07 | 45.73 |

5 | 32.4 | 40.93 | 44.75 | |

4 | 32.38 | 37.92 | 40.23 |

Note: *mAP score of models on VOC2007.

The results of quantization for activations are shown in

Model | A-bits | Log (w, a) | LogQ(w), Log(a) | LogQ(w), Flog(a) |
---|---|---|---|---|

ResNet-18 |
6 | 36.29 | 50.92 | 69.44 |

5 | 36.29 | 50.92 | 69.3 | |

4 | 32.46 | 47.2 | 67.44 | |

ResNet-34 |
6 | 33.52 | 49.95 | 71.11 |

5 | 33.52 | 49.95 | 70.67 | |

4 | 29.8 | 47.88 | 69.95 | |

AlexNet |
6 | 44.7 | 54.77 | 56.4 |

5 | 44.6 | 54.77 | 56.25 | |

4 | 40.81 | 52.84 | 55.04 | |

Tiny YOLOv2* |
6 | 32.4 | 41.07 | 45.73 |

5 | 32.4 | 41.07 | 45.07 | |

4 | 32.39 | 41.01 | 45.04 |

Note: *mAP score of models on VOC2007.

Model | W-bits | A-bits | Top-1 | Retraining |
---|---|---|---|---|

ACIQ [ |
8 | 4 | 65.80 | No |

ZeroQ [ |
2–8 | 4 | 69.05 | No |

MSQ [ |
4 | 4 | 70.27 | Yes |

(This work) | 6 | 4 | 67.44 | No |

MXQN [ |
8 | 8 | 67.61 | No |

PACT [ |
5 | 5 | 69.8 | Yes |

(This work) | 5 | 5 | 68.5 | No |

This section presents the hardware accelerator for running quantized neural networks inference on FPGA. The Zedboard Zynq7000 based FPGA is used as an edge device for the hardware experiment. This board includes a dual-core ARM Cortex A9 and a Zynq XC7Z020-based FPGA chip. The trained weights are stored on the SD card. The accelerator is implemented in the programmable logic (PL) part.

Zynq Processor: We use this processor to control the workflow of ConNN inference. To start ConNN inference, the processor initializes the necessary peripherals. After that, the processor loads the trained weights from the SD card to the DRAM memory using DDR control module. When the framework requires convolutional computation, zynq processor will send configuration parameters, weights and activations addresses in DRAM and then enable the accelerator to perform convolution.

Logarithmic Accelerator: This is the hardware core for running neural network inference. The accelerator consists of a processing element (PE) for multiplication and accumulation. There are multiple PEs in a single accelerator and all PEs runs simultaneously to support parallel computation. The details of this module are described in the next section.

Internal memory: Typically, FPGA does not have enough on-chip memory to store all parameters. Therefore, we design memory buffer for weights and activations. These buffers are generated by 36 Kb block RAM (36 Kb-BRAM) resource in FPGA.

Advanced eXtensible Interface (AXI): The AXI is used for sending configurations parameters from the processor to the accelerator. The Memory AXI (M_AXI) is built for the accelerator to access external memory. Thus, the accelerator can control load/store instructions.

We implement the convolution computation based on the General Matrix Multiplication (GEMM). The activations and weights are stored in the DRAM by Zynq. The runtime workflows start with loading the input activations and the weights of the first layer into the on-chip memory in the accelerator. The processing elements then compute the convolution and stores the output activations in the output buffer. After completing the convolution, the accelerator writes output activations back to the DRAM memory. Then the accelerator starts the process again for computing convolution for the next layer.

In general, a processor for computing convolution consists of a set of multipliers and adders. In this work, the quantized value is expressed as the power of base 2. Therefore, we can replace the multiplier with bit-shift operator in convolution. We simplify the multiplication of quantized activations and weights on the log domain. The algorithm can be expressed as:

where

In the hardware-level implementation, we avoid using multipliers by storing all multiplications

Since weights are stored as indexes of Qset, we use look-up table to convert the index to integer part of quantized weights (

To accelerate the computation, we implement multiple PEs to support parallel computation. The maximum number of PEs is

FPGA devices consist of Digital Signal Processing (DSP) blocks, Look-up Tables (LUTs), Flip-Flops (FFs) and Block RAM memory (BRAM). These resources, particularly DSP and LUTs, are very limited in edge FPGA devices. For this reason, we constraint the resource usage by setting the amount of storage available. This defines the structure of the accelerator as mentioned in section 7.2. We instantiate the structure of the accelerators as:

The size of parameter is another important factor in mapping ConNN to hardware. A larger bit-width requires a high number of resources. However, bit-width and accuracy are trade-off. The quantized model evaluation results show that weight quantization has a high impact on accuracy. Therefore, we focus on using 6 bit-width for weights quantization, whereas bit-width for activations is changed.

Accelerator | Bit-width (w,a) | #BRAM 18 K |
#FFs |
#LUTs |
#DSP |
---|---|---|---|---|---|

Fixed-point | 8,8 | 56 (20%) | 7732 (7.27%) | 7444 (13.99%) | 80 (36.36%) |

Logarithmic | 6,6 | 64 (22.5%) | 8140 (7.65%) | 11248 (21.14%) | 4 (1.8%) |

6.5 | 64 (22.5%) | 8139 (7.65%) | 11234 (21.11%) | 4 (1.8%) | |

6,4 | 64 (22.5%) | 7077 (7.24%) | 10497 (19.73%) | 4 (1.8%) |

Using a multiplier requires DSP and LUTs, whereas a bit shifter requires only LUTs. Therefore, the proposed hardware significantly reduces the number of DSP 20

This section presents a performance comparison with other prior designs. In this implementation, the Zynq XC7Z020-based FPGA is used under 150 MHz operating frequency. We run different neural network models on the device to evaluate the proposed accelerator. Our goal is to run complex neural networks on a resource-constrained FPGA device. We therefore concentrate on the efficient use of resources. The throughput is also a key factor for running the model in a real-world environment. We capture the actual execution time of the accelerator and calculate the throughput using Giga Operation Per Second (GOP/s) metric. However, the throughput and resource usage are trade-off. For this reason, we evaluate the performance of the accelerator using resource efficiency metrics, which are the ratio of GOP/s per number of DSP and GOP/s per number of LUTs.

The performance of the proposed accelerator and its comparison with other prior works are presented in

Design | ResNet-18 | AlexNet | Tiny YOLOv2 | ||||
---|---|---|---|---|---|---|---|

[ |
[ |
Our | [ |
Our | [ |
Our | |

Device | XC7Z020 | ZCU102 | XC7Z020 | XC7Z045 | XC7Z020 | XC7Z020 | XC7Z020 |

Bit-width | w4/a4 | 16 bits | w6/a6 | w8/a8 | w6/a6 | w8/a4 | w6/a6 |

Accuracy | 70.27 | – | 69.44 | 54.6 | 56.4 | – | 45.73 |

#BRAM | 112 | 912 | 64 | 303 | 64 | 104 | 64 |

#FFs | 17083 | – | 8140 | 513870 | 8140 | – | 8140 |

#LUTs | 28288 | 552K | 11248 | 86282 | 11248 | 49158 | 11248 |

#DSP | 220 | 1144 | 4 | 808 | 4 | 124 | 4 |

GOP/s | 77.0 | 291.4 | 37.5 | 493 | 37.27 | 26.73 | 36.58 |

0.35 | 0.254 | 9.38 | 0.061 | 9.32 | 0.22 | 9.15 | |

2.72 | 0.53 | 3.33 | 5.74 | 3.31 | 0.54 | 3.25 |

This article presents a modified logarithmic quantization to compress convolutional neural networks without retraining process. We propose a simple but effective method for solving low resolution at high-level input issues of logarithmic quantization. Different logarithmic-based methods are performed for quantizing weights and activations. As a result, quantized models achieve comparable accuracy to floating-point models. Additionally, we propose an efficient hardware accelerator for ConNN inference on low-end FPGA devices. With the advantages of logarithmic compression, we present the hardware processor that computes convolution using bit shifters and adders. The proposed hardware design is resource efficient and suitable for use on embedded devices. For future work, we recommend to further study the optimal bit-width quantization for each layer in ConNN, as each layer has a different distribution.