Convolutional Neural Networks (CNNs) are widely used in many fields. Due to their high throughput and high level of computing characteristics, however, an increasing number of researchers are focusing on how to improve the computational efficiency, hardware utilization, or flexibility of CNN hardware accelerators. Accordingly, this paper proposes a dynamically reconfigurable accelerator architecture that implements a Sparse-Winograd F(2 × 2.3 × 3)-based high-parallelism hardware architecture. This approach not only eliminates the pre-calculation complexity associated with the Winograd algorithm, thereby reducing the difficulty of hardware implementation, but also greatly improves the flexibility of the hardware; as a result, the accelerator can realize the calculation of Conventional Convolution, Grouped Convolution (GCONV) or Depthwise Separable Convolution (DSC) using the same hardware architecture. Our experimental results show that the accelerator achieves a 3x–4.14x speedup compared with the designs that do not use the acceleration algorithm on VGG-16 and MobileNet V1. Moreover, compared with previous designs using the traditional Winograd algorithm, the accelerator design achieves 1.4x–1.8x speedup. At the same time, the efficiency of the multiplier improves by up to 142%.

At present, CNNs are widely used in various fields impacting different areas of people’s lives, such as image classification [

Many solutions have been proposed to solve the associated problems. These include using the Graphics Processing Units (GPUs) [

As face recognition, motion tracking and other applications have become widely used in both mobile devices and embedded devices, the demand for CNNs in these devices is also increasing. Because embedded and mobile devices have more limited hardware resources and higher power consumption requirements, lightweight CNNs such as MobileNet [

This paper proposes a fast decomposition method based on the Sparse-Winograd algorithm (FDWA); this approach extends the applicability of the Winograd algorithm to various convolution shapes and further reduces the computational complexity. Based on unified hardware implementation, it is able to provide a high level of configurability, to realize fast switching between different configurations for different convolution shapes at runtime. Generally speaking, the main advantages of this architecture are as follows:

Using the FDWA can not only reduce the number of multiplications and shorten the loop iteration of the convolution, but also introduces dynamic sparseness into the network; this can reduce hardware workload, improve the speed of convolution execution, expand the use scope of the Sparse-Winograd algorithm, and simplify the hardware implementation. Moreover, compared with the conventional accelerator design, the proposed hardware architecture design also enhances the adaptability and flexibility of the accelerator, meaning that the accelerator can accelerate both Conventional Convolutions and the new types of convolution (such as DSC or GCONV) without the need to change the hardware architecture.

This paper proposes a high-throughput 2D computing architecture with reconfigurable I/O parallelism, which improves PE Array utilization. A linear buffer storage structure based on double buffers is used to reduce the data movement overhead between all levels of storage and to hide the data movement time, such that the computing performance of the entire system runs at the peak computing speed of the PE Array.

The remainder of this paper is organized as follows. The second section summarizes the related work. The third section briefly introduces the Winograd algorithm and outlines the operation of the FDWA in detail. Section 4 details the architecture of the accelerator, while Section 5 provides the implementation results and comparison. Finally, Section 6 presents the conclusion.

Previous studies have proposed a wide range of CNNs accelerator architectures. The DianNao [

In this article, in order to speed up operation, save on hardware resources, and improve the calculation efficiency, we designed the FDWA as a basic hardware design based on the Sparse-Winograd algorithm. Moreover, we go on to test the algorithm acceleration effect and hardware resource consumption via SystemC [

Firstly, it is feasible in practical terms to accelerate CNNs at the algorithm level. Various advanced convolution algorithms have therefore been applied to accelerator design. Taking Winograd, FFT, and FFA algorithms as examples, these approaches reduce the number of multiplication operations by reducing the computational complexity, thereby further saving on resources. Compared with the FFT and FFA algorithms, the Winograd algorithm can achieve better results when the convolution kernel is smaller [

Secondly, the speed of CNN updating has accelerated of late, and there are usually great differences between the different convolutional layers in CNNs, meaning that higher requirements are placed on the flexibility of the accelerator. However, most studies based on algorithm acceleration typically only support a specific and fixed convolution type. At the same time, lightweight CNNs such as MobileNet and ShuffleNet have seen widespread use on mobile devices or embedded platforms, thereby reducing both the number of parameters and the computational complexity; one example is MobileNetV1 [

At present, there have been few studies on hardware accelerators for lightweight CNNs. While Bai et al. [

CNNs are primarily composed of convolutional layers, Rectified Linear Units (ReLUs), pooling layers, fully connected layers, etc. Of these components, the convolution layer has made a major contribution to CNN operation; however, it also takes up most of the calculation time.

Following the emergence of lightweight CNNs, new convolution method have been widely used due to their outstanding advantages. DSC was first proposed in MobileNet V1, while GCONV first appeared in AlexNet. As shown in

Model | ImageNet accuracy | Million Mult-Adds | Million parameters |
---|---|---|---|

MobileNet-224 |
70.6% |
569 |
4.2 |

Model | Input feature | Filter | Parameters | Calculation amount |
---|---|---|---|---|

Conventional Convolution | ||||

GCONV | ||||

DSC |

The Winograd algorithm is a new, fast algorithm for CNNs, which reduces the number of multiplication operations by transforming the input feature map and executing a series of transformations on the filter. Taking the one-dimensional Winograd algorithm as an example, the specific operation process can be expressed as follows:

In

Similarly, the operation process of the 2D Winograd algorithm

It can be concluded from

By conducting further research into the Winograd algorithm, Liu et al. [

By moving ReLU and Prune to the Winograd domain, the number of multiplication operations can be reduced, while the overall efficiency can be further improved when the Winograd algorithm is used for acceleration; thereby, the acceleration of a sparse network based on the Winograd algorithm can be realized.

Although convolution operations based on the Sparse-Winograd algorithm can reduce the number of required multiplication operations and improve the operational efficiency, for different types of convolutions, the operation process is overall the same as the Winograd algorithm but requires more complex pre-operations, primarily as regards the calculation of the conversion matrix A, B and G. Moreover, the calculation of the conversion matrix is also difficult to realize via hardware; in addition, with increases in size of the input feature map or the convolution kernel, the data range of the parameters in the conversion matrix becomes very large, which presents challenges for the design of hardware, storage, bandwidth, and power consumption. The above problems restrict the development of CNN accelerators based on the Sparse-Winograd algorithm.

In order to solve the above problems, we propose a method of FDWA based on the Sparse-Winograd algorithm, which operated by decomposing the input feature map and filter on the basis of the Sparse-Winograd

As shown in

In this chapter, the accelerator scheme will be comprehensively introduced.

1) Control module: This component is responsible for receiving the task information sent by the CPU. When the task is loaded on to the accelerator and calculations begin, the module will distribute the task to the calculation module, and prompt the DMA module to either move the data to the buffer or write to the external DDR through the interconnection configuration. After calculation is complete, the module will move the result to the external DRAM and notify the CPU.

2) Data read (write) and conversion module: The input feature tile data is read from the buffer, after which the input is transferred to the Winograd domain through the input conversion module via the implementation of

3) Data calculation module: After conversion to the Winograd domain, the data input operation module completes the point product operation, after which the data flows to the accumulation module. In terms of the process of performing the accumulation operation, the information from the controller is received, and GCONV and DSC are completed by separating the channels. In the Sub-Channel operation of DSC, the data from all input channels are converted directly and then saved to the on-chip buffer. For GCONV, moreover the input channels are divided into several groups, after which the data in each group is aggregated and stored in the on-chip buffer following conversion.

The calculation module is used to process the converted data of the input feature map and filter.

In the multiplication calculation array module, since the accelerator uses FDWA based on

The input data is calculated and output by means of a dot product calculation module to a specific configurable accumulator for processing purposes. Through the setting of different multi-channel accumulator functions, the convolution acceleration calculation of the lightweight CNNs can be realized; this process also ensures the configurability of the accelerator.

As shown in

Since the on-chip storage space is insufficient to store the entire input feature map and filter, a specific storage structure is required if we are to improve the degree of data reuse and improve the smooth data transfer between storage structures at all levels.

In this section, an analysis model will be established to help with theoretical analysis of the accelerator performance.

H | The height of the feature map |

W | The width of the input feature map |

M | The number of channels of the input feature map |

K | The size of the filter |

N | The number of channels of the filter |

Freq | The clock frequency |

q | The sparsity of the Sparse-Winograd matrix |

r | The scale of the Accelerator PE |

In theory, the total number of multiplication (

The data throughput of the entire accelerator in a single cycle is

Furthermore, the time required for the whole accelerator to run a single-layer convolution

SystemC is a collaborative software/hardware design language, as well as a new system-level modeling language. It contains a series of C++ classes and macros, and further provides an event-driven simulation core enabling the system designer to use C++ morphology in order to simulate parallel processes. In this paper, system C is used to accurately simulate each module and thread to facilitate counting of the number of cycles under a specific load. In order to fully consider the accelerator’s acceleration effect under various convolution conditions, this paper utilizes workloads that are typically representative of both Conventional Convolution and new convolution applications, which are named VGG-16 and MobileNet v1 respectively. The architecture parameters of each network are listed in

M | N | H(W) | K | M | N | H(W) | K | ||
---|---|---|---|---|---|---|---|---|---|

Conv1.1 | 3 | 64 | 224 | 3 | DSC1 | 32 | 64 | 112 | 3 |

Conv1.2 | 64 | 64 | 224 | 3 | DSC2 | 64 | 128 | 112 | 3 |

Conv2.1 | 64 | 128 | 112 | 3 | DSC3 | 128 | 128 | 56 | 3 |

Conv2.2 | 128 | 128 | 112 | 3 | DSC4 | 128 | 256 | 56 | 3 |

Conv3.1 | 128 | 256 | 56 | 3 | DSC5 | 256 | 256 | 28 | 3 |

Conv3.2 | 256 | 256 | 56 | 3 | DSC6 | 256 | 512 | 28 | 3 |

Conv3.3 | 256 | 256 | 56 | 3 | DSC7 | 512 | 512 | 14 | 3 |

Conv4.1 | 256 | 512 | 28 | 3 | DSC8 | 512 | 512 | 14 | 3 |

Conv4.2 | 512 | 512 | 28 | 3 | DSC9 | 512 | 512 | 14 | 3 |

Conv4.3 | 512 | 512 | 28 | 3 | DSC10 | 512 | 512 | 14 | 3 |

Conv5.1 | 512 | 512 | 14 | 3 | DSC11 | 512 | 512 | 14 | 3 |

Conv5.2 | 512 | 512 | 14 | 3 | DSC12 | 512 | 1024 | 14 | 3 |

Conv5.3 | 512 | 512 | 14 | 3 | DSC13 | 1024 | 1024 | 7 | 3 |

The first layer of the network, as the original input, does not pass through either the ReLU layer or the Prune layer, and its parameter density is 100%. In the simulation process, the accelerator operating frequency is set to 200 MHz, while the data itself is represented by 8-bit fixed points.

Moreover,

From the comparative analysis of

Design | Working frequency (MHz) | Throughput |
Multiplier amount | Utilization |
---|---|---|---|---|

Shen et al. [ |
200 | 943 | 756 | 120% |

Aydonat et al. [ |
303 | 1382 | 1476 | 93% |

Zhang et al. [ |
150 | 137 | 780 | 18% |

Zhang et al. [ |
200 | 266 | 1058 | 25% |

This article | 200 | 945 | 576 | 142% |

Based on the Sparse-Winograd algorithm, a highly parallel design of a reconfigurable CNN accelerator based on Sparse-Winograd