Massive computational complexity and memory requirement of artificial intelligence models impede their deployability on edge computing devices of the Internet of Things (IoT). While PowerofTwo (PoT) quantization is proposed to improve the efficiency for edge inference of Deep Neural Networks (DNNs), existing PoT schemes require a huge amount of bitwise manipulation and have large memory overhead, and their efficiency is bounded by the bottleneck of computation latency and memory footprint. To tackle this challenge, we present an efficient inference approach on the basis of PoT quantization and model compression. An integeronly scalar PoT quantization (IOSPoT) is designed jointly with a distribution loss regularizer, wherein the regularizer minimizes quantization errors and training disturbances. Additionally, twostage model compression is developed to effectively reduce memory requirement, and alleviate bandwidth usage in communications of networked heterogenous learning systems. The product lookup table (PLUT) inference scheme is leveraged to replace bitshifting with only indexing and addition operations for achieving lowlatency computation and implementing efficient edge accelerators. Finally, comprehensive experiments on Residual Networks (ResNets) and efficient architectures with Canadian Institute for Advanced Research (CIFAR), ImageNet, and Realworld Affective Faces Database (RAFDB) datasets, indicate that our approach achieves 2×∼10× improvement in the reduction of both weight size and computation cost in comparison to stateoftheart methods. A PLUT accelerator prototype is implemented on the Xilinx KV260 Field Programmable Gate Array (FPGA) platform for accelerating convolution operations, with performance results showing that PLUT reduces memory footprint by 1.45×, achieves more than 3× power efficiency and 2× resource efficiency, compared to the conventional bitshifting scheme.
The new era of the IoT enables a smart society by interconnecting cyberspace with the physical world. At the same time, artificial intelligence (AI) is widely spread in a variety of business sectors and industries. A number of revolutionary applications in computer vision, games, speech recognition, medical diagnostics, and many others are reshaping our everyday lives. Traditionally, IoT devices would send data to a centralized cloud server for processing and analysis. However, this approach can lead to delays due to data transmission and processing times. To address this issue, edge computing in IoT devices is proposed, referring to the practice of processing data closer to its source on the edge of the network, near the devices that generate the data. This approach aims to minimize latency, reduce bandwidth usage, enhance privacy and security, and improve overall system efficiency. By deploying computing resources closer to where data is generated, it enables faster decisionmaking, critical for applications that require realtime responses (e.g., industrial automation, autonomous vehicles, and healthcare monitoring).
Despite the myriad advantages inherent in edge computing, it grapples with multifaceted challenges. The foremost challenge lies in resource constraints. Edge devices often have limited computational power, memory, and storage capacity compared to centralized servers. This restricts the complexity and scale of computations that can be performed at the edge. Therefore, edge computing in IoT devices calls for not only lowpower chips for energy efficient processing at the edge but also a model with low latency and less accuracy reduction. To address this issue, a variety of lowprecision quantization methods are emerged to reduce model size and arithmetic complexity.
Quantization attempts to reduce data bitwidth in IoT devices’ local computing, shrink model size for memory saving, and simplify the operations for compute acceleration. Lowprecision quantization has caught plenty of attention, wherein inference acceleration is to substitute intricate 32bit floatingpoint (FP32) multiplication for fewerbit multiplication. An integerarithmetic inference framework is introduced to quantize fullprecision weights and activations into 8bit integer values, while bias parameters remain 32bit to maintain baseline accuracy [
On the one hand, extreme lowprecision quantization is an intriguing area of research that converts fullprecision values into representations using a minimal number of bits, thus requiring the least computation. Learned Step Size Quantization Plus (LSQ+) improves lowbit quantization via learning optimal offsets and using better initialization for parameters [
On the other hand, the PoT paradigm is widely explored and serves as a promising approach to bridging the gap between computing acceleration and accuracy degradation. PoT introduces bitshifting operations in computation in order to achieve higher numerical resolution. In earlier works, groupwise PoT quantization with iterative retraining is adopted to quantize the pretrained FP32 model into its lowprecision representation, where weights are quantized uniformly [
However, general researches based on the PoT scheme require substantial bitwise operations and rely heavily on either computational/memoryexpensive FP32 hyperparameters or fixedpoint quantization on activations to achieve acceptable accuracy performance [
To achieve lowlatency inference and resource/memoryefficient computation at the edge of the IoT, we propose an endtoend efficient inference approach that encapsulates Internetworking Operating SystemCisco (IOS)PoT quantization, twostage model compression, and PLUT inference. To address the practical needs of IoT scenarios, a hardware accelerator with our approach is developed using Verilog Hardware Description Language (HDL) and deployed on Field Programmable Gate Array (FPGA) platforms with a camera. First, the 32bit width values from the original model undergo IOSPoT quantization. Then, we use Verilog HDL to efficiently develop registertransfer level (RTLlevel) hardware code for the accelerator and deploy the inference method on the Xilinx KV260 FPGA platform. This process is illustrated in
We present an endtoend quantization and efficient inference approach, realized by hardwarefriendly IOSPoT quantization, model compression, and PLUT, which is dedicated to effectively deploying deep neural models on edge learning devices with limited resources and power budget, and implementing efficient acceleration infrastructure.
To shrink the model size, IOSPoT quantization and twostage model compression are proposed to map a fullprecision model to low bitwidth representation, where a jointly designed distributionloss regularizer is introduced to minimize the mismatch problem between quantization inputs and outputs, and signedHuffman (SHuff) is introduced for improving memory efficiency.
To reduce the bottleneck of computing latency and hardwareresource overhead, PLUT inference is proposed to substitute matrix multiplication with only value indexing and addition, in which PLUT is shared among lowprecision weights and activations at inference. Furthermore, an inference accelerator prototype based on the PLUT scheme for convolution operations on FPGA is developed.
Due to the complex architecture, Deep Neural Networks (DNNs) suffer from huge memory consumption and computation powers as well as considerable inference latency. As embedded computing systems are rapidly proliferating every aspect of human endeavor today, these drawbacks pose a huge challenge to deploying deeper models on edge platforms. To address this issue, a bulk of works have emerged, while allowing for tolerable performance degradation. Existing works include but are not limited to network pruning, lowprecision quantization, and weight compression.
Weight compression refers to reducing the memory requirement of model parameters, in which efficient encoding schemes are usually employed to decrease memory access and storage size. A network compression pipeline is introduced, which combines pruning, quantization, weight sharing, and Huffman coding to achieve a remarkable compression ratio. Pruning attains the highest compression ratio around 10×, while quantization and encoding provide an additional 2×∼4× [
Assume that a pretrained fullprecision (e.g., 32bit floatingpoint) convolutional neural network (CNN) is represented by
In order to boost hardware efficiency and lower latency at the inference phase for DNNs, uniform PoT quantization is introduced by restricting all weights to be PoT values or zeros as shown:
As PoT quantization is nonuniform, it has a good resolution for weight approximation owing to the exponential property. Quantization levels are represented by PoT values, which means that multiplication operations between a PoT number
Quantization which maps highprecision numbers to lowprecision representations is inherently leading to deviations. In quantized neural networks (QNNs), the objective is to ensure acceptable prediction accuracy is achieved, on which the degree of quantization errors has a direct impact. A variety of methods for optimizing quantization loss to achieve good final accuracy are emerged [
One approach in literature to achieve PoT quantization is via learning lookup tables (weight assignment tensors and dictionaries) [
To summarize, our approach has three distinctions compared with previous methods. (1) With IOSPoT, tensor scaling at runtime is simply achieved by bitshifting operation, omitting division operations as in prior PoT methods. (2) We utilize a twostage compression pipeline to reduce memory requirement, where the SHuff encoding and decoding hardware are leveraged to reduce memory footprint at inference. (3) In our product lookup table (PLUT) scheme, multiplications are not required at runtime. Thus, indexing and addition are two main operations for computing matrix products, breaking the bottleneck of computation latency and hardwareresource overhead.
The proposed method consists of IOSPoT quantization, distributionloss regularizer, model compression, and PLUT inference scheme (illustrated in
IOSPoT quantization is introduced to remove the computation overhead arising from the adoption of FP32 scaling parameters, for which the required arithmetic operations are mere bitshift and addition. Quantization is performed on weights and activations to project fullprecision values onto lowprecision representations during the forward pass, and then the prediction accuracy is calculated, while FP32 weights and gradients are kept for gradient descent during the backward pass.
To avoid matrix multiplication involving full precision numbers at inference, the quantized model is subject to full PoT representation in IOSPoT. For lowbit projection, the clipping threshold is defined as
In IOSPoT quantization, each level is defined by the equation as shown below:
Considering the quantization flow during retraining, the clipping function is firstly performed on weights with a PoT scaling factor
Stochastic gradient descent (SGD) is applied to jointly optimize the scaling factor
A regularizer on distribution loss is developed to alleviate the mismatch between fullprecision and lowprecision weights. Gradients of weights during backpropagation do not always contribute to diminishing the gap between quantized outputs and inputs, leading to severe accuracy degradation and model divergence. By employing distribution loss, the mismatch problem is steadily reduced.
Consider a general nonlinear DNN consisting of
In addition, a learnable hyperparameter
Model compression and inference schemes based on PLUT are designed jointly with the IOSPoT quantization scheme, to facilitate reducing computations and memory requirements, and to enable efficient implementations of specialized accelerators. Ahead of deployment, codebooks of lowprecision weights and activations are generated and compressed by SHuff with the signencoding technique. At runtime inference, codebooks are unpacked and loaded onto memory banks while activations are quantized to low precision by special hardware logic in real time.
Weight sharing and SHuff encoding methods constitute the first stage for weight reduction, while the last stage (named post compression) is implemented by a standard compression algorithm. In IOSPoT quantization, weights and activations are projected onto low bitwidth quantization levels with different optimized integer scaling parameters. We observe that a great deal of zeros exist in weights. To remove inconsequential memory storage, the sign bit is separated from weight values, and for weights whose values are zero, only sign bits are kept. Hence, weight sharing and Huffman encoding are applied to nonzero weights exclusively to generate a lightweight model. Then, PoT quantization levels are effectively encoded to produce PLUT for inference. Post compression based on traditional compression algorithm is adopted to reduce the size of generated codebooks, e.g., LempelZivMarkov chain Algorithm (LZMA) [
To sum up, the compression flow is realized by a twostage pipeline: (1) Weight sharing and SHuff encoding, (2) Specialized compressing algorithm. The details are illustrated in
Matrix multiplication is fundamental in realizing DNN computation, in which the complexity of multiplication determines inference latency and energy efficiency. For the classic PoT scheme, matrix multiplication requires a substantial amount of bitshifting and addition operations, e.g., LUTNet which adopts lookup tables for mapping fullprecision model to PoT representation directly [
Matrix product obtained via PLUT is equivalent to the values obtained from standard multiplyadd accumulations (MACs), more efficient, and only at the small cost of extra memory space. The matrix multiplication performed upon PLUT is exemplified in
To assess the memory overhead and computation efficiency of PLUT, an analysis is presented.
Given 4bit quantized weights and activations (1bit for sign and 3bit for value), the dimension of the product matrix of PLUT is
Comprehensive experiments are conducted to justify the efficacy of the proposed approach with comparisons to prior methods. Furthermore, the datasets we select are closely with the IoT scenario. Firstly, image classification tasks on CIFAR10/100 and ImageNet datasets are selected for evaluation [
32bit floatingpoint models are trained from scratch as the baseline metrics for performance evaluation. The accuracy performance is evaluated based on Top1 and Top5 metrics as following the conventional assessment standards [
To assess computational complexity, a bitoperation scheme is employed to calculate the amount of computation under different low bitwidth, as introduced in [
The CIFAR10 dataset consists of 60,000 RGB (Red, Green, Blue) images, and the size of each image is
The CIFAR100 is similar to the CIFAR10 but has a lot more image categories. Image classification on CIFAR100 is more challenging. There are 100 classes which are grouped into 20 superclasses, each of which comes with a “fine” label (the class to which it belongs), and a “coarse” label (the superclass to which it is closely related). Each class is comprised of 600 images, with 500 images used for training and 100 images for testing.
ImageNet (ILSVRC12) is a much more difficult task to deal with, which contains approximately 1.2 million training and 50 k validation
RAFDB contains 30,000 facial images annotated with basic or compound expressions by 40 trained human coders. In the experiment, only images with six basic expressions (neutral, happiness, surprise, sadness, anger, disgust, fear) and neutral expressions are used which leads to 12,271 images for training and 3068 images for testing. All images are resized to
DNN models are simulated under the Pytorch framework, and all models with full precision are trained from scratch as initialization. The quantization for low bitwidth starts with pretrained models, and then iteratively retrains lowbit weights to recover accuracy loss. Weight normalization is adopted before quantization to assist stabilizing weight distribution [
In this section, the proposed approach is quantitatively analyzed from both software and inference hardware aspects. Firstly, the efficacy and influence of IOSPoT quantization and model compression are validated on ResNet18 with the CIFAR100 dataset. Secondly, the inference efficiency and performance of different hardware implementations built on the FPGA platform are thoroughly investigated.
The right part of
In contrast, the quantization levels from integer scalar exhibit regularities of simple PoT terms, e.g.,
The impact of distributionloss regularizers is carefully investigated on ResNet18. Different settings are demonstrated in the experiment, i.e., no regularization applied (a), fixedparameter (b), and the proposed approach choosing a learning strategy (c). The retraining settings are as: 0.04 for learning rate with a decay factor of 0.1 at epoch (60, 120, 160, 200, and 260), 300 epochs of total training, and 128 for batch size. The coefficient
The results are plotted based on convergence curves (retraining, testing Top1 and Top5 accuracy), in
From the graph, the observation is that stable convergence achieved by learnable regularization offers the benefits of reducing retraining overhead without losing prediction accuracy. The training overhead and accuracy results are presented in
Method  Precision (W/A)  Top1  Top5  Epoch 

FP32  32/32  74.08%  92.26%  
IOSPOT (original)  4/4  74.20%  92.05%  >180 
IOSPOT (fixed 
4/4  74.21%  92.00%  >200 
IOSPOT (learnable 
4/4 
To demonstrate the efficacy of the proposed model compression, a comprehensive investigation regarding Huffman encoding is presented on ResNet18 after retraining converged, where FP32, NonHuffman which only considers low bitwidth weights without using additional compression methods, and SHuff approach are compared under APoT and the proposed IOSPoT scheme. Furthermore, the performance of twostage model compression is analyzed.
Layer  Weight shape  Weights  Zeroweights  FP32 (KB) 

Conv3  524,288  63,293  2,048  
Conv4  2,097,152  230,420  8,192  
Conv5  8,388,608  1,616,156  32,768 
We find that a great deal of weights tend to be zeros when the quantized model is converged, and thus, employing an efficient encoding method to reduce the memory consumption of these weights is of significant importance. This fact is evidenced in
The utilization of hardware resources for PLUT and bitshifting (SHIFT) arithmetic is simulated on Xilinx KV260 FPGA whose available resources are presented in
Part #  Xck26sfvc7842LVc 

Node  28 nm 
LUTs  117,120 
D flipflops (DFFs)  234,240 
Block random access memories (BRAMs)  5184 KB 
Ultra random access memories (URAMs)  18,432 KB 
Digital signal processors (DSPs)  1248 
Double data rate fourth generation synchronous dynamic random access memory (DDR4)  19 GB/s 
CPU  QuadCore ARM Cortex®A53 64bit 
We first compare the proposed method with prior PoT methods on CIFAR datasets, where residual deep neural models are used for performance evaluation. Then, experiments on efficient model architectures (e.g., SqueezeNet and ShuffleNetV2) are analyzed and compared with stateoftheart (SOTAs) approaches to further examine the quantization method.
The performance of ResNet20 (RES20) and ResNet56 (RES56) on CIFAR10 is investigated, while ResNet18 is experimented on the CIFAR100 dataset [
The results on CIFAR10/100 are given in
The proposed approach achieves the most promising compression rate among all selected PoT methods with a negligible loss in prediction accuracy compared to the fullprecision baseline model. On ResNet20/56 with 3bit precision, our method achieves
Most of the Top1 and Top5 performances of the proposed approach surpass others, while few exceptions do exist, wherein our method only introduces accuracy loss of less than 0.2%, e.g., Top1 accuracy of APoT outperforming the proposed approach by 0.06% on ResNet18 under 5bit precision. Among the tested methods, APoT is ranked as the second best algorithm to achieve competitive accuracy performance, while PACT and DeepShift result in substantial accuracy drop under lowprecision such as 3bit, causing more than a 3% decline in Top1 accuracy. Overall, the proposed quantization scheme maintains good accuracy with almost no loss of accuracy while achieving a significant reduction in both memory requirement and computation overhead. Our method is still accuracy competitive even under extreme lowprecision (e.g., 3/4bit), gaining huge improvement in acceleration efficiency.
Method  Precision  Accuracy  Weight  Delta  FIXOPS  Comp.Rate  

(W/A)  Top1  Top5  Size  Acc1  Acc5  Weight  FIXOPS  
FP32 (RES20)  32/32  92.36%  99.78%  1.2 MB  –  –  744 M  –  – 
APoT  4/4  92.52%  148 KB  0.16%  0.03%  194 M  8 
3.8 

LQNet  4/4  90.81%  99.63%  148 KB  −1.55%  −0.15%  200 M  8 
3.7 
DeepShift  4/8  89.93%  99.69%  148 KB  −2.43%  −0.09%  266 M  8 
2.8 
PACT  4/4  89.88%  99.62%  148 KB  −2.48%  −0.16%  200 M  8 
3.7 
Ours  4/4  99.79%  
APoT  3/3  92.00%  99.74%  111 KB  −0.36%  −0.04%  174 M  11 
4.3 
LQNet  3/3  90.61%  99.74%  111 KB  −1.75%  −0.04%  180 M  11 
4.1 
DeepShift  3/8  86.20%  99.47%  111 KB  −6.16%  −0.31%  240 M  11 
3.1 
PACT  3/3  89.46%  99.73%  111 KB  −2.90%  −0.05%  180 M  11 
4.1 
Ours  3/3  
FP32 (RES56)  32/32  93.64%  99.73%  3.4 MB  –  –  2287 M  –  – 
APoT  4/4  93.73%  99.81%  537 KB  0.09%  0.08%  572 M  8 
4 
LQNet  4/4  91.73%  99.69%  537 KB  −1.91%  −0.04%  578 M  8 
3.9 
DeepShift  4/8  91.97%  99.70%  537 KB  −1.67%  −0.03%  768 M  8 
3 
PACT  4/4  90.11%  99.59%  537 KB  −3.75%  −0.14%  578 M  8 
3.9 
Ours  4/4  
APoT  3/3  92.78%  99.67%  432 KB  −0.86%  −0.06%  510 M  10 
4.5 
LQNet  3/3  91.65%  99.72%  432 KB  −1.99%  −0.01%  516 M  10 
4.4 
DeepShift  3/8  87.24%  99.35%  432 KB  −6.40%  −0.38%  686 M  10 
3.3 
PACT  3/3  89.19%  99.67%  432 KB  −4.45%  −0.06%  516 M  10 
4.4 
Ours  3/3 
Method  Precision  Accuracy  Weight  Delta  FIXOPS  Comp.Rate  

(W/A)  Top1  Top5  Size  Acc1  Acc5  Weight  FIXOPS  
FP32  32/32  74.08%  92.26%  43.9 MB  –  –  677 M  –  – 
APoT  6/6  74.29%  92.00%  9963 KB  0.21%  −0.26%  210 M  4.5 
3.2 
LQNet  6/6  72.01%  91.17%  9963 KB  −2.07%  −1.09%  245 M  4.5 
2.7 
DeepShift  6/8  73.83%  91.98%  9963 KB  −0.25%  −0.28%  325 M  4.5 
2.1 
PACT  6/6  70.13%  90.32%  9963 KB  −3.95%  −1.94%  245 M  4.5 
2.7 
Ours  6/6  
APoT  5/5  8586 KB  193 M  5.2 
3.5 

LQNet  5/5  72.05%  91.15%  8586 KB  −2.03%  −1.11%  227 M  5.2 
2.9 
DeepShift  5/8  73.53%  91.77%  8586 KB  −0.55%  −0.49%  300 M  5.2 
2.3 
PACT  5/5  70.79%  90.49%  8586 KB  −3.29%  −1.77%  227 M  5.2 
2.9 
Ours  5/5  74.17%  92.26%  0.09%  0.00%  
APoT  4/4  74.39%  92.20%  7210 KB  031%  −0.06%  175 M  6.2 
3.9 
LQNet  4/4  71.66%  91.25%  7210 KB  −2.42%  −1.01%  210 M  6.2 
3.2 
DeepShift  4/8  74.07%  7210 KB  −0.01%  270 M  6.2 
2.5 

PACT  4/4  70.05%  90.32%  7210 KB  −4.03%  −1.94%  210 M  6.2 
3.2 
Ours  4/4  92.21%  −0.05%  
APoT  3/3  91.89%  5834 KB  −0.37%  158 M  7.7 
4.3 

LQNet  3/3  71.55%  90.97%  5834 KB  −2.53%  −1.29%  193 M  7.7 
3.5 
DeepShift  3/8  59.30%  85.85%  5834 KB  −14.78%  −6.41%  250 M  7.7 
2.7 
PACT  3/3  70.00%  90.25%  5834 KB  −4.08%  −2.01%  193 M  7.7 
3.5 
Ours  3/3  74.38%  −0.30% 
Quantization of efficient neural architectures is often a challenging task, and experiments on such architectures are vitally important for lowprecision quantization research. To verify the performance, experiments are performed on CIFAR100, where SqueezeNet and ShuffleNetV2 DNN models are selected and quantized under 3/4bit precision, [
The performance evaluation is presented in
Method  Precision  Accuracy  Weight  Delta  FIXOPS  Comp.Rate  

(W/A)  Top1  Size  Acc1  Weight  FIXOPS  
FP32 (ShuffleNetV2)  32/32  70.10%  5.5 MB  –  820 M  –  – 
BMfloat  4/4  64.95%  1221 KB  −5.15%  194 M  4.6 
4.2 
PROFIT  4/4  64.90%  1221 KB  −5.20%  212 M  4.6 
3.8 
Ours  4/4  
BMfloat  3/3  46.00%  1067 KB  −24.10%  176 M  5.2 
4.6 
PROFIT  3/3  63.08%  1067 KB  −7.02%  190 M  5.2 
4.3 
Ours  3/3  
FP32 (SqueezeNet)  32/32  68.72%  3.1 MB  –  987 M  –  – 
BMfloat  4/4  65.50%  653 KB  −3.22%  266 M  4.8 
3.7 
PROFIT  4/4  653 KB  287 M  4.8 
3.4 

Ours  4/4  68.79%  0.07%  
BMfloat  3/3  41.00%  563 KB  −27.72%  246 M  5.6 
4.0 
PROFIT  3/3  563 KB  262 M  5.6 
3.7 

Ours  3/3  66.92%  −1.80% 
For computation efficiency, our approach on average is two orders of magnitude with respect to others (
Experiments on representative ImageNet datasets are also conducted. In this experiment, the proposed approach is evaluated on a commonly used ResNet18 neural network, with a comparison to newly published SOTA methods in the literature, e.g., blockmini float (BMfloat) [
Method  Precision  Accuracy  Weight  Delta  FIXOPS  Comp.Rate  

(W/A)  Top1  Size  Acc1  Weight  FIXOPS  
FP32  32/32  69.75%  43.9 MB  –  32.76 G  –  – 
APoT  4/4  70.10%  9963 KB  0.35%  5.8 G  4.5 
5.6 
BMfloat  4/4  69.00%  8115 KB  −0.75%  5.1 G  5.5 
6.4 
Ours  4/4  
APoT  3/3  69.11%  8586 KB  −0.64%  4.7 GM  5.2 
7 
DeepShift  5/8  8586 KB  7.5 G  5.2 
4.4 

BMfloat  3/3  66.80%  6762 KB  −2.95%  4.1 G  6.6 
8 
Ours  3/3  69.33%  −0.42%  
APoT  4/4  69.00%  7210 KB  −0.75%  4 G  6.2 
8.2 
DeepShift  4/8  69.56%  7210 KB  −0.19%  6.4 G  6.2 
5.1 
LSQ+  4/4  5674 KB  3.7 G  7.9 
8.9 

Ours  4/4  69.35%  −0.40%  
APoT  3/3  68.55%  5834 KB  −1.20%  3.1 G  7.7 
10.6 
LSQ+  3/3  4323 KB  2.7 G  10.4 
12.1 

Ours  3/3  68.95%  −0.80% 
Furthermore, a task on facial expression recognition is used to validate the robustness of quantization methods, with SelfCureNetwork [
Method  Precision  Accuracy  Delta  FIXOPS  Comp.Rate 

(W/A)  Top1  FIXOPS  
FP32  32/32  76.69%  –  32.76 G  – 
APoT  4/4  76.86%  0.17%  4 G  8 
Ours  4/4  
APoT  3/3  75.85%  −0.84%  3.1 G  11 
Ours  3/3 
In this section, we target implementing an efficient CNN accelerator. The hardware accelerator of the proposed PLUT inference scheme is developed on the Xilinx KV260 FPGA platform by using Verilog HDL. Then, the efficiency of our implementation is evaluated and compared against a conventional bitshifting scheme (adopted by prior PoT methods), as well as generalpurpose computing hardware (Intel Core (TM) i32120 3.30 GHz and NVIDIA RXT3060 12 GB).
The overall architecture of the PLUT accelerator is shown in
The performance of memory footprint, decoding overhead of SHuff, and computation efficiency of the proposed accelerator benchmarks on ResNet18 with ImageNet dataset, where the convolution operations are performed on the CNN accelerator, and ARM CPU manages the transfer and control schedule. Overall, PLUT achieves
LUT  LUTRAM  DFF  BRAM  DSP  

Avail.  117,120  57,600  234,240  144  1248 
PoTSHIFT  40,546  1740  54,030  66  0 
PLUT  19,267  1841  23,753  87  0 
Layer  Accesses  SHuff decoder  

PLUT  PoT SHIFT  Efficiency  Latency ( 

Conv3  46,080  65,536  1.42 
281 
Conv4  181,760  262,144  1.44 
1139 
Conv5  721,920  1,048,576  1.45 
4134 
Layer  PLUT  PoT SHIFT  CPU  GPU 

Latency (ms)  
Conv3  25.7  102.3  406  2.7 
Conv4  20.4  80  380  2.5 
Conv5  20.4  81.4  380  2.1 
POWER (W)  4  4.2  65  170 
GOPS/W  90  28  0.4  44 
In this paper, we expounded the superiority and limits of edge computing on IoT devices, investigated PoT quantization based on bitshifting logic, and discovered that the computation and memory efficiency of prior schemes is not optimal, and difficult for AI models applying to resourcelimited edge computing devices with intensive communications in IoT scenario. To tackle this challenge, we proposed a PLUT inference approach with IOSPoT quantization and compression techniques. We found that the mismatch problem between quantization inputs and outputs can be mitigated by employing a tailored distributionloss regularizer which assists in quantization convergence. Efficient inference is achieved by PLUT with weight sharing and SHuff encoding, which reduces memory footprint and eliminates multiplication operations for acceleration. Comprehensive experiments were conducted on ResNets and efficient architectures (i.e., ShuffleNetV2 and SqueezeNet). With CIFAR10/100, ImageNet, and RAFDB datasets to validate the efficacy of the proposed approach. Our approach outperformed existing PoT and SOTA methods by several orders of magnitude in weight and computation reduction, achieving
The authors would like to acknowledge the financial support of State Key Laboratory of Intelligent Vehicle Safety Technology, Chongqing Municipal Education Commission, and the technical support of Foresight Technology Institute, Chongqing Changan Automobile Co., Ltd., and School of Computer Science and Engineering, Chongqing University of Science and Technology.
This work was supported by Open Fund Project of State Key Laboratory of Intelligent Vehicle Safety Technology by Grant with No. IVSTSKL202311, Key Projects of Science and Technology Research Programme of Chongqing Municipal Education Commission by Grant with No. KJZDK202301505, Cooperation Project between Chongqing Municipal Undergraduate Universities and Institutes Affiliated to the Chinese Academy of Sciences in 2021 by Grant with No. HZ2021015 and Chongqing Graduate Student Research Innovation Program by Grant with No. CYS240801.
The authors confirm contribution to the paper as follows: study conception and design: Fangzhou He, Dingjiang Yan, Jie Li; data collection: Dingjiang Yan, Ke Ding, Jiajun Wang, Mingzhe Chen; analysis and interpretation of results: Fangzhou He, Dingjiang Yan, Jie Li; draft manuscript preparation: Fangzhou He. All authors reviewed the results and approved the final version of the manuscript.
The datasets that support the findings of this study are openly available and have been cited from reference.
Not applicable.
The authors declare that they have no conflicts of interest to report regarding the present study.