With the continuous development of deep learning, Deep Convolutional Neural Network (DCNN) has attracted wide attention in the industry due to its high accuracy in image classification. Compared with other DCNN hardware deployment platforms, Field Programmable Gate Array (FPGA) has the advantages of being programmable, low power consumption, parallelism, and low cost. However, the enormous amount of calculation of DCNN and the limited logic capacity of FPGA restrict the energy efficiency of the DCNN accelerator. The traditional sequential sliding window method can improve the throughput of the DCNN accelerator by data multiplexing, but this method’s data multiplexing rate is low because it repeatedly reads the data between rows. This paper proposes a fast data readout strategy via the circular sliding window data reading method, it can improve the multiplexing rate of data between rows by optimizing the memory access order of input data. In addition, the multiplication bit width of the DCNN accelerator is much smaller than that of the Digital Signal Processing (DSP) on the FPGA, which means that there will be a waste of resources if a multiplication uses a single DSP. A multiplier sharing strategy is proposed, the multiplier of the accelerator is customized so that a single DSP block can complete multiple groups of 4, 6, and 8-bit signed multiplication in parallel. Finally, based on two strategies of appeal, an FPGA optimized accelerator is proposed. The accelerator is customized by Verilog language and deployed on Xilinx VCU118. When the accelerator recognizes the CIRFAR-10 dataset, its energy efficiency is 39.98 GOPS/W, which provides 1.73 × speedup energy efficiency over previous DCNN FPGA accelerators. When the accelerator recognizes the IMAGENET dataset, its energy efficiency is 41.12 GOPS/W, which shows 1.28 × −3.14 × energy efficiency compared with others.

In recent years, with the rapid development of artificial intelligence, various applications based on DCNN have attracted extensive attention in the industry. DCNN has not only made meaningful progress in the field of computer vision, such as image recognition and detection [

Currently, the mainstream hardware deployment platforms of DCNN include CPU, ASIC, FPGA, etc. The advantages of CPU are mature micro-architecture and advanced semiconductor device technology, but its serial computing mechanism limits the operational speed of DCNN. ASIC is integrated circuits designed and manufactured according to user demand. It has the advantages of high computing speed and low energy consumption, but its disadvantages are high design cost and time cost. FPGA is a field programmable gate array, allowing repeated programming, and has the advantages of low energy consumption, parallelism, and low cost, so FPGA can be used in parallel computing, cloud computing [

However, the enormous network parameters and computing scale of DCNN, as well as the limited logic capacity of FPGA, make hardware deployment on FPGAs inefficient. In order to improve the operation speed, various methods including data multiplexing [

In order to improve the energy efficiency of DCNN deployed on FPGA, this paper proposed an FPGA optimized accelerator, which focuses on improving the operation speed and reducing energy consumption. 1) A circular sliding window (CSW) data reading method is proposed, compared with other methods, the ‘down-right-up-right’ circular sliding window mechanism is adopted to read the input data. Some input data between rows are multiplexed via this method, so the CSW data reading method can take less time to finish the data reading of a single channel. What’s more, the CSW data reading method lets output data in adjacent locations, which can save the data rearrangement time consumption before entering the pooling unit. 2) Two types of customized multipliers are introduced to implement multi-groups of signed data multiplication in a single DSP, compared with other multipliers, the customized multipliers in this paper use sign-magnitude representation to encode signed data, so, without bit expansion or result correcting. The customized multipliers in this paper can still support the operation that both weight data and input data are signed data.

This chapter mainly introduces two strategies of the FPGA optimized accelerator to improve energy efficiency. Firstly, the operation speed of FPGA based DCNN accelerator is improved through the fast data readout strategy. Secondly, the multiplier sharing strategy is adopted to reduce the energy consumption of FPGA.

In order to improve the operation speed of the FPGA-based DCNN accelerator, the research can be carried out from two aspects: reducing the data bandwidth requirement during network operation and reducing the amount of network computation. This paper mainly studies from the perspective of reducing the data bandwidth requirements during network operations.

There are numerous multiplication and accumulation operations in the operation process of DCNN, but limited by the logical capacity of FPGA, DCNN often only performs single-layer partial multiplication and addition operations, so it is necessary to store the intermediate data generated by the operation. Considering that the memory bandwidth of FPGA is fixed, it is necessary to reduce the data bandwidth requirement of network operation. There are three main methods to reduce the data bandwidth requirement of network operation: data multiplexing, network quantification, and standardized data access patterns. This paper mainly uses data multiplexing and network quantification.

Data multiplexing achieves the purpose of acceleration by reducing the access frequency to memory, which is generally divided into three types [

As shown in

The first advantage, as shown in

The proposed method could be extended to neural networks with different sizes of kernel, padding, and stride. If the size of the input channel is N, the number of padding is P, the stride is S, kernel size is K, to read each channel of one convolutional layer in parallel, the time cost of SSW and CSW data reading method is shown in

The saving ratio of time cost defined by (T1−T2)/T1 is shown in

The second advantage, as shown in

The stride length and kernel size of the pooling layer could affect the size of the cache space.

k_size | Stride | Cached rows (CSW) | Cached rows (SSW) | Cache saved |
---|---|---|---|---|

2 |
2 |
2 |
2 |
√ |

2 |
2 |
2 |
× | |

2 |
2 |
2 |
2 |
× |

2 |
2 |
2 |
√ |

The fast data readout strategy mainly improves the speed of input data reading, so it is applied to the INPUT CU unit of the FPGA optimized accelerator.

In order to reduce the energy consumption of FPGA, this paper mainly studies from the perspective of reducing the resource consumption of FPGA. Because the core operations of the convolutional layer and the fully connected layer are both multiplication and addition operations, this paper uses the method of multiplexing the same operation array with the full connection layer and the convolution layer to improve the resource utilization of FPGA. In addition, since DCNN is quantized, the data width of multiplication is 8 bits or less. One DSP on FPGA supports a group of 27 bit × 18 bit multiplications. Therefore, a multiplier sharing strategy could be applied. The multiplier in the operation array is customized, where each DSP can simultaneously complete two or three groups of multiplication to reduce the resource consumption of FPGA.

In this paper, sign-magnitude representation is used to encode signed data. For example, −4 is expressed as 1_00100, and 4 is expressed as 0_00100. The ordinary DSP block of an off-the-shelf FPGA device is modified to a Signed-Signed Three Multiplication (SSTM) unit or a Signed-Signed Double Multiplication (SSDM) unit. The SSTM architecture supports one DSP to complete the multiplication of three groups of 6-bit signed data at the same time, but the three groups of multiplication need to maintain the same input data or weight data. The SSDM architecture supports one DSP to complete two groups of 6-bit signed data multiplication at the same time. The input data and weight data of the two groups of multiplication can be different. The sign bit and magnitude bits in the proposed two structures are operated separately. The sign bit is operated by XOR gates, and the magnitude bit is operated by ordinary DSP. This paper mainly introduces the structure of SSTM and SSDM when the data width is 6, but SSTM and SSDM can also apply to other data widths.

As shown in

The sign bits of the input data (in [5]) and the sign bits of the three groups of weight data (w1[5], w2[5], w3[5]) are extracted for XOR operation, respectively. The sign bits of three groups of multiplication output data can be obtained.

Considering the number of convolutional kernels in the networks is always an integer multiple of 4. If the accelerator only uses the SSTM structure, it will result in some DSPs not working during the calculation of each convolutional layer not work, which will reduce DSP efficiency, so the accelerator also adopts some SSDM structures that can support input data be different.

As shown in

The two groups’ sign bits of the input data (in1[5], in2[5]) and the sign bits of the two groups of weight data (w1[5], w2[5], w3[5]) are extracted for XOR operation, respectively. The sign bits of two groups of multiplication output data can be obtained.

Besides, SSTM and SSDM structures can also apply to data whose data width is not 6. As shown in

Bit | Input | Weight | Out | LUT |
---|---|---|---|---|

4 | 1 | 4 | 4 | 2 |

6 | 1 | 3 | 3 | 2 |

8 | 1 | 2 | 2 | 1 |

Bit | New input | New weight | Out | LUT |
---|---|---|---|---|

4 | 15 | 9 | 24 | 1 |

5 | 20 | 12 | 32 | 1 |

6 | 25 | 15 | 40 | 1 |

Overall, SSTM and SSDM structure can be used in other networks that are quantified, SSDM is limited by quantization precision, while with the development of lightweight networks, more complicated networks can be quantified to lower data width, [

This chapter mainly introduces the hardware circuit structure of the FPGA optimized accelerator. As shown in

The data path of the accelerator is shown in

When the parameters of DCNN are quantized, the bit widths of the input data and weight data of each layer of DCNN are the same. Therefore, the proposed architecture only completes the design of the DCNN one-layer circuit structure and then continuously multiplexes the entire circuit until all layers of DCNN are calculated.

The memory cell consists of three parts: DDR4, RAM group, and ROM. DDR4 mainly stores weight data. Because the amount of DCNN weight data is large, on-chip storage resources cannot meet its needs. In addition, considering that the DDR4 read and write clock is inconsistent with the accelerator master clock, it is necessary to add a FIFO unit between the DDR4 and the control unit for buffering. The RAM group mainly stores the transformable data of the DCNN operation process. It is used to store the input data and output data of the DCNN single layer. It is divided into two groups, with 64 RAMs in each group. Since the architecture proposed in this paper is in the form of layer-by-layer calculation, when one group of RAM groups is used as input to provide input data for the control unit, the other group of RAM groups is used as output to save the output data. ROM mainly stores the invariant data during DCNN operation, which is used to store the bias value and quantization parameters of DCNN.

The control unit consists of six parts: TOP CU, INPUT CU, RAM SEL, WEIGHT CU, bias data and quantization parameter control unit (B_Q CU) and OUTPUT CU.

TOP CU unit mainly controls the data flow direction of the entire accelerator and provides hyper-parameters for other control units. Its control logic is shown in Algorithm 1. Since the operation unit can only calculate the multiplication and accumulation of 64 groups of input channels and 4 groups of convolution kernels in parallel, the whole control logic is divided into three cycles. The first loop is expanded to complete the operation of all input channels corresponding to the four groups of convolution kernels in the convolution layer or the operation of 512 input neurons corresponding to 4 output neurons in the fully connected layer. The operation process includes multiplication, quantification, and activation. The second loop is expanded to complete the operation of all convolution kernels in the single-layer convolution layer or the operation of all output neurons in the single-layer fully connected layer. The third loop is expanded to complete the operation of all layers of DCNN, and the corresponding pooling operation is completed according to the number of layers.

INPUT CU unit mainly reads the input data from the memory cell to the calculation unit. As shown in

The CSW data reading method contains four states, each state updates 9 input data stored in registers, and 9 input data are transmitted to the calculation unit in 2 cycles. In the first state, the input data in ‘1’, ‘4’, ‘5’, ‘6’ and ‘0’ positions are sent to PE array in the first clock cycle, the input data in ‘2’, ‘6’, ‘9’, ‘6’ and ‘0’ positions are sent to PE array in the second clock cycle, and then, entering the second state through ‘down’ sliding window. In the second state, the input data in ‘5’, ‘8’, ‘9’, ‘12’, and ‘4’ positions are sent to the PE array in the first clock cycle, the input data in ‘6’, ‘10’, ‘13’, ‘14’ and ‘4’ positions are sent to PE array in the second clock cycle, and then, entering the third state through ‘right’ sliding window. In the third state, the input data in ‘6’, ‘9’, ‘10’, ‘13’ and ‘5’ positions are sent to PE array in the first clock cycle, the input data in ‘7’, ‘11’, ‘14’, ‘15’ and ‘5’ positions are sent to PE array in the second clock cycle, and then, entering the fourth state through ‘up’ sliding window. In the fourth state, the input data in ‘2’, ‘5’, ‘6’, ‘9’ and ‘1’ positions are sent to PE array in the first clock cycle, the input data in ‘3’, ‘7’, ‘10’, ‘11’ and ‘1’ positions are sent to PE array in the second clock cycle, and then, entering the first state through ‘right’ sliding window.

The CSW method will make the address of input data control logic more complicated. However, the structure reads 64 input data in the same position of different channels, so if the address of one input channel control logic is implemented, other channels only use the same address. As a result, it has a slight increase in logic resources because of more complicated control logic and has a small effect on the performance of the entire structure.

The OUTPUT CU unit mainly processes the output data of the calculation unit and saves it to the storage unit. As shown in

For the Q/A unit, the quantization and activation operations of the accumulated results are completed. Considering that the quantization parameters are floating-point, in order to avoid the multiplication of floating-point, in this paper, the quantization and activation function is fused. The fusion principle is: The data width of the input data is x-bit, the data width of the output data is 5-bit, and the value range is [0, 31]. As shown in

As shown in

For the pooling unit, the accelerator uses the max pooling operation. Considering that the four groups of data of the pooling operation come in different clock cycles, a comparator is used to complete four comparisons in four clocks. For every four comparisons, the non-input end of the comparator is set to 0. The structure is shown in

For the SOFTMAX unit, as shown in

The RAM SEL unit provides input data for the INPUT CU unit to select the corresponding RAM according to the number of layers. The RAM SEL unit selects the corresponding RAM for the OUTPUT CU unit to save the single-layer operation results. WEIGHT CU unit reads the weight data from DDR4 to the FIFO unit and then transmits the weight in the FIFO to the calculation unit. Since the read-and-write clock frequency of DDR4 is faster than the clock frequency of the accelerator, the PING-PONG mechanism is used to read the weight data. B_Q CU unit reads the bias data and quantization parameters required for the operation from the ROM to the OUTPUT CU unit.

The calculation unit is the core unit of the accelerator, completing the multiplication of the convolutional layer and the fully connected layer. As shown in

The input data of the same column in the first row and the second row is the same, and the weight data of each operation unit is different. In addition, the PE array supports two working modes, which are used for the operation of the convolution layer and the fully connected layer, respectively.

In the working mode of the convolution layer: 64 PE3 units and 96 PE2 units are all in a working state, the PE array can complete the multiplication of 64 input channels and 4 convolutional kernels in 2 clock cycles. Because each convolutional kernel has 9 groups of weight data, a single clock cycle of the PE array can complete 1152 (64 × 4 × 9 ÷ 2) groups of multiplication in parallel. The data path is shown in

In the working mode of the fully connected layer: 64 PE3 units in the first row and 64 PE2 units in the second row are in a working state, but 32 PE2 units in the third row are not working. The PE array can complete the multiplication of 512 input neurons and 4 output neurons in 2 clock cycles. A single clock cycle of the PE array can complete 1024 (512 × 4 ÷ 2) groups of multiplication in parallel.

This chapter compares and summarizes the performance parameters of the FPGA optimized accelerator architecture with other published work. Firstly, the network structure used to evaluate and how to complete network quantization is introduced. Secondly, the speed improvement brought by the fast data readout strategy is analyzed and summarized. Thirdly, the energy saving of the multiplier sharing strategy is analyzed. Finally, the overall energy efficiency of the accelerator is evaluated.

The datasets should be publicly available to ensure the experiments are reproducible, so the IMAGENET dataset and CIRFAR-10 dataset are selected to be identified by the FPGA optimized accelerator. In addition, as can be seen from _{absolute}) by the fast data readout strategy can also be affected by the image size, so the datasets of two different sizes can be further used to verify the relationship between T_{absolute} and image size, the corresponding result will show in

VGG16 is composed of thirteen convolutional layers and three fully connected layers. After each convolutional layer, it will undergo the bn (batch normalization) layer and activation function. The 13 convolutional layers are divided into five groups. After each group of convolutional layers, a pooling operation is performed to reduce the feature image size. For the CIRFAR-10 dataset, the image size is 32 × 32 × 3. For the IMAGENET dataset [

LAYER | CIRFAR-10 | IMAGENET | ||
---|---|---|---|---|

WEIGHT | INPUT | WEIGHT | INPUT | |

1–2 | 3 × 3 × 64, 64 | 32 × 32 | 3 × 3 × 64, 64 | 224 × 224 |

3–4 | 3 × 3 × 128, 128 | 16 × 16 | 3 × 3 × 128, 128 | 112 × 112 |

5–7 | 3 × 3 × 256, 256 | 8 × 8 | 3 × 3 × 256, 256 | 56 × 56 |

8–10 | 3 × 3 × 512, 512 | 4 × 4 | 3 × 3 × 512, 512 | 28 × 28 |

11–13 | 3 × 3 × 512, 512 | 2 × 2 | 3 × 3 × 512, 512 | 14 × 14 |

14 | 512 × 512 | 1 | 25088 × 4096 | 1 |

15 | 512 × 512 | 1 | 4096 × 4096 | 1 |

16 | 512 × 10 | 1 | 4096 × 1000 | 1 |

According to whether the quantized network needs retraining, the quantization method is divided into PTQ [

As shown in _{y} (mean) and δ_{y2} (variance) of the bn (batch normalization) layer to obtain the corrected parameters w’ (corrected weight) and b’ (corrected bias). The bn layer has been integrated into the weights, so there is no need to deploy the bn layer on the FPGA. Second, the parameters x (input data), w’, and b’ are quantified, respectively. The quantization equations are shown in _{x}_{x}_{b}_{’ }− _{b}_{’}), and weight data (_{w}_{’ }− _{w}_{’}) are all integers. The quantized input data, bias data, and weight data are used for convolution operation. The result (

Using the PTQ algorithm, the accuracy of VGG16 in recognizing CIRFAR-10 under different data widths is obtained, as shown in

When the accelerator recognizes the CIRFAR-10 dataset, the system clock frequency of the accelerator is 200 MHz, and the latency for inferring one image is 2.02 ms, so the throughput of the accelerator can be calculated, which is 310.62GOPS. When the accelerator recognizes the IMAGENET dataset, the system clock frequency of the accelerator is 150 MHz, and the latency for inferring one image is 98.77 ms, the corresponding throughput is 313.35GOPS.

Layer | Epoch | CIRFAR-10 | IMAGENET | ||||
---|---|---|---|---|---|---|---|

Size | Delay (us) | Delay/image (us) | Size | Delay (ms) | Delay/image (ms) | ||

1 | 1 | 32 * 32 | 5.43 | 224 * 224 | 0.34 | ||

2 | 16 | 32 * 32 | 87.04 | 224 * 224 | 5.40 | ||

3 | 32 | 16 * 16 | 46.08 | 112 * 112 | 2.72 | ||

4 | 64 | 16 * 16 | 92.16 | 112 * 112 | 5.45 | ||

5 | 128 | 8 * 8 | 51.2 | 56 * 56 | 2.77 | ||

6 | 256 | 8 * 8 | 102.4 | 56 * 56 | 5.54 | ||

7 | 256 | 8 * 8 | 102.4 | 56 * 56 | 5.54 | ||

8 | 512 | 4 * 4 | 61.44 | 12 * 12 | 2.87 | ||

9 | 1024 | 4 * 4 | 122.88 | 12 * 12 | 5.73 | ||

10 | 1024 | 4 * 4 | 122.88 | 12 * 12 | 5.73 | ||

11 | 1024 | 2 * 2 | 40.96 | 12 * 12 | 1.53 | ||

12 | 1024 | 2 * 2 | 40.96 | 12 * 12 | 1.53 | ||

13 | 1024 | 2 * 2 | 40.96 | 12 * 12 | 1.53 |

FF | LUT | DSP | BRAM (36 Kb) | |
---|---|---|---|---|

Available | 62589 | 82576 | 451 | 86.5 |

Utilization | 2.65% | 6.98% | 6.59% | 4.00% |

FF | LUT | DSP | BRAM (36 Kb) | |
---|---|---|---|---|

Available | 63835 | 83934 | 451 | 983 |

Utilization | 2.70% | 7.10% | 6.59% | 45.51% |

The energy consumption of various hardware resources when deploying the accelerator on an FPGA is shown in

To reduce the energy consumption of logical resources and ultimately decrease the overall energy consumption of the FPGA optimized accelerator, the accelerator primarily improves the utilization of DSPs. For the multiplier, the accelerator has been custom-designed and introduces the SSTM and SSDM architectures, enabling each DSP to perform multiplication in parallel for 2–3 groups of signed data. The calculation unit of the accelerator consists of 256 SSTM and 192 SSDM, capable of parallel calculation for a total of 1152 groups of multiplication. According to the data in

Num | DSP | Power | ||
---|---|---|---|---|

Single multiplication | 1152 | 1155 | 1.687 W | |

Multiple multiplication | SSTM | 256 | 451 | 0.657 W |

SSDM | 192 |

The SSTM and SSDM architectures proposed in this article have some unique features compared to other single DSP implementations of multiple multiplication architectures. As for the SSTM structure, as can be seen from

[ |
[ |
SSTM | ||||||
---|---|---|---|---|---|---|---|---|

Input | Bit | 8 | 4 | 6 | 8 | 4 | 6 | 8 |

Type | Unsigned | Signed | Signed | |||||

Num | 1 | 1 | 1 | |||||

Weight | Bit | 8 | 4 | 6 | 8 | 4 | 6 | 8 |

Type | Signed | Unsigned | Signed | |||||

Num | 2 | 6 | 4 | 3 | 4 | 3 | 2 | |

Output | Num | 2 | 6 | 4 | 3 | 4 | 3 | 2 |

Loss | No | No | Yes | No | ||||

LUT | 0 | 16 | 38 | 57 | 2 | 2 | 1 |

Bit | Input | Weight | Output | LUT | |||
---|---|---|---|---|---|---|---|

SSDM | 4 | Signed | 2 | Signed | 2 | 2 | 1 |

5 | 2 | 1 | |||||

6 | 2 | 1 | |||||

[ |
4 | Unsigned | 2 | Signed | 2 | 2 | 0 |

The accelerator chosen is VGG16 as the DCNN model for recognizing the CIFAR-10 dataset and the IMAGENET dataset. This paper customizes various parts of the FPGA optimized accelerator using the Verilog language and deploys it on the Xilinx VCU118. The recognition of the CIFAR-10 dataset and the IMAGENET dataset is evaluated using energy efficiency (GOPS/W) and DSP efficiency (GOPS/DSP) as evaluation metrics.

For CIFAR-10 dataset recognition, as shown in

Frequency (MHz) | BRAM (36 Kb) | DSP | Delay (ms) | Throughput (GOPS) | Power (W) | DSP efficiency (GOPS/DSP) | Energy efficiency (GOPS/W) | |
---|---|---|---|---|---|---|---|---|

[ |
100 | 447 | 471 | 3.38 | 188.41 | 8.15 | 0.40 | 23.11 |

This work | 100 | 86.5 | 451 | 4.04 | 155.31 | 5.72 | 0.34 | 27.15 |

200 | 86.5 | 451 | 2.02 | 310.62 | 7.77 | 0.69 | 39.98 |

For IMAGENET dataset recognition, as shown in

[ |
[ |
[ |
[ |
[ |
[ |
[ |
This work | ||
---|---|---|---|---|---|---|---|---|---|

Platform | VCU118 | ZCU102 | VX6907 | XCZ7020 | VU9P | ZCU706 | ZCU102 | VCU118 | |

Frequency (MHz) | 150 | 200 | 150 | 214 | 125 | 166 | 200 | 150 | 200 |

BRAM (36 Kb) | 1779 | 1460 | 2365 | 85.5 | 1732 | 652 | 912 | 983 | 983 |

DSP | 4096 | 1352 | 2688 | 190 | 5349 | 793 | 1144 | 451 | 451 |

Throughput (GOPS) | 2558.3 | 495.4 | 829.84 | 84.3 | 1068.37 | 167.58 | 309 | 313.35 | 417.81 |

Power (W) | N/A | 15.4 | 31.2 | 3.5 | 48.62 | 6.08 | 23.6 | 8.23 | 10.16 |

DSP efficiency (GOPS/DSP) | 0.62 | 0.37 | 0.31 | 0.44 | 0.2 | 0.21 | 0.27 | 0.69 | 0.92 |

Energy efficiency (GOPS/W) | N/A | 32.17 | 26.6 | 24.09 | 21.97 | 27.56 | 13.09 | 38.06 | 41.12 |

With more BRAM resources and DSP resources, other accelerators may achieve higher throughput than the FPGA optimized accelerator. However, it is worth noting that each DSP unit of the FPGA optimized accelerator can perform an average of 2.55 multiplication. Additionally, by adopting the CSW data reading method, the FPGA optimized accelerator can save approximately 0.33 × the data reading time. These features make the accelerator excel in terms of energy efficiency, reaching high standards.

In this paper, the FPGA optimized accelerator architecture is proposed. To achieve the aim of improving the energy efficiency of the accelerator, the fast data readout strategy is used by the accelerator, compared with the SSW data reading method, it can reduce 33% of inference delay, and delay savings of individual channels are positively correlated with the image size. Moreover, the multiplier sharing strategy adopted by the accelerator can save 61% of the DSP resources of the accelerator, which leads to a decline in energy consumption. Finally, the DSP efficiency and energy efficiency of the accelerator are evaluated. When the system clock of the accelerator is set to 200 MHz, for the CIRFAR-10 dataset, the DSP efficiency is 0.69 GOPS/DSP, which provides 1.73 × speedup DSP efficiency over previous FPGA accelerators, and its energy efficiency is 39.98 GOPS/W, which shows 1.73 × energy efficiency compared with others. For the IMAGENET dataset, the DSP efficiency is 0.92 GOPS/DSP, which provides 1.48 × −4.6 × speedup DSP efficiency over previous FPGA accelerators, and its energy efficiency is 41.12 GOPS/W, which shows 1.28 × −3.14 × energy efficiency compared with others.

VGG16 is selected as the DCNN model for testing in this paper, but the fast data readout strategy and the multiplier sharing strategy proposed in this paper are also suitable for other neural networks, such as Resnet, MobileNet, etc., which is the focus of future work.

The authors are very grateful to the National University of Defense Technology for providing the experimental platforms.

This work was supported in part by the Major Program of the Ministry of Science and Technology of China under Grant 2019YFB2205102, and in part by the National Natural Science Foundation of China under Grant 61974164, 62074166, 61804181, 62004219, 62004220, and 62104256.

T. Ma and Z. Li. contributed equally to this work, the corresponding author is Q. Li. The authors confirm their contribution to the paper as follows: study conception and design: T. Ma, Z. Li and Q. Li; data collection: T. Ma and Z. Zhao; analysis and interpretation of results: Z. Li, H. Liu and Y. Wang; draft manuscript preparation: T. Ma and Z. Li. All authors reviewed the results and approved the final version of the manuscript.

The data will be made publicly available on GitHub after Tuo Ma. Author completes the degree.

The authors declare that they have no conflicts of interest to report regarding the present study.

^{4}-DNN: A hybrid DSP-LUT-based processing unit with operation packing and out-of-order execution for efficient realization of convolutional neural networks on FPGA devices