Image semantic segmentation has become an essential part of autonomous driving. To further improve the generalization ability and the robustness of semantic segmentation algorithms, a lightweight algorithm network based on SqueezeandExcitation Attention Mechanism (SE) and Depthwise Separable Convolution (DSC) is designed. Meanwhile, AdamGC, an Adam optimization algorithm based on Gradient Compression (GC), is proposed to improve the training speed, segmentation accuracy, generalization ability and stability of the algorithm network. To verify and compare the effectiveness of the algorithm network proposed in this paper, the trained network model is used for experimental verification and comparative test on the Cityscapes semantic segmentation dataset. The validation and comparison results show that the overall segmentation results of the algorithm network can achieve 78.02% MIoU on Cityscapes validation set, which is better than the basic algorithm network and the other latest semantic segmentation algorithms network. Besides meeting the stability and accuracy requirements, it has a particular significance for the development of image semantic segmentation.
With the combination of Artificial Intelligence (AI) and automobile transportation, autonomous driving [
In recent years, due to the progress of large datasets, powerful computing power, complex network architectures and optimization algorithms, the application of deep learning in the field of image semantic segmentation has achieved major breakthroughs [
Most of the traditional semantic segmentation methods are the early semantic segmentation methods, which were first applied to the medical field with simple scenes and obvious differences between the background objects. The main researches are: segmentation methods based on threshold [
In the field of semantic segmentation based on deep learning, convolutional neural network has become an important means of image processing, which can fully utilize the semantic information of images to achieve semantic segmentation. To cope with the increasingly complex challenges of image segmentation scenarios, a series of deep learningbased image semantic segmentation methods have been proposed to achieve more accurate and efficient segmentation, and further promote the application scope of image segmentation. Image semantic segmentation based on region classification and image semantic segmentation based on pixel classification are the current mainstream deep learningbased semantic segmentation methods. The former divides the image into a series of target candidate regions and classifies the target region through the deep learning algorithm, which can avoid the generation of superpixels and improve the efficiency of image segmentation effectively. The former is represented by MPA [
As one of the representatives in the field of semantic segmentation algorithms, UNet uses the “encoderdecoder” structure to perform feature fusion between feature maps, so that the shallow convolutions can focus on texture features and the deep convolutions focus on image essential features. This paper selects the UNet semantic segmentation algorithm as the basic algorithm for research. In recent years, in the study of UNet semantic segmentation network, Huang et al. [
Change the convolution method and update the standard convolution to depthwise separable convolution, which reduces the calculation parameters and realizes the separation of channels and regions.
The attention mechanism is introduced in this paper, which enables the network to learn weight information from the feature channel dimension: the weight of the feature channel with good network performance is improved, and the weight of the feature channel with poor network performance is suppressed, so that the training efficiency can be improved.
For the segmentation accuracy and generalization ability of the semantic segmentation algorithm, this paper operates on the gradient directly and smoothes the gradient curve by using a suitable gradient compression method. Meanwhile, gradient compression can regularize the weight space and output feature, thereby improving the performance of the detection algorithm.
Activation function plays an essential role for neural network models to learn and understand the complex and nonlinear input characteristics. In this paper, the widelyused LeakyReLU function is used as the activation function.
Although the traditional ReLU activation function has a faster calculation speed and convergence speed, however, when the input is negative, the neuron cannot update the parameters because of its 0 value output. As is shown in
The basic assumption of Depthwise Separable Convolution [
The relationship between the feature map channels is particularly important in image semantic segmentation, especially in autonomous driving. Therefore, this paper introduces the SE [
The main operations of the SE module are: Squeeze and Excitation. The SE module compresses the input feature map to obtain channellevel global features first, then performs excitation operations on the global features. While learning the relationship between each feature channel, it also obtains the weights of different feature channels, and finally multiplied with the input features. Finally, multiply the feature map with the input feature map to get the final map. The Squeeze operation can be expressed as follows:
In the formula,
The Excitation operation can be expressed as follows:
In the formula,
After the above operations, the output weight of the excitation operation is multiplied by the original input feature and the output of the SE module is:
In the formula:
Optimization techniques are of great significance for improving the performance of neural networks. Currently, the optimization methods used in the field of semantic segmentation algorithms mainly include BN (batch normalization), which works in the activation function and WS (weight standardization), which operates on weights [
Among the optimization algorithms that operate on gradients in the field of semantic segmentation algorithms, the most common methods are to calculate the momentum of the gradient. The main optimization algorithms are Stochastic Gradient Descent with Momentum (SGDM) [
The formula for Gradient Compression is as follows:
In the above formula,
As long as the network obtain the mean of the gradient matrix, subtract the mean value from the column vector of each gradient, and then multiply it by the gradient smoothing coefficient, it can get the update direction of the optimal weight. The calculation of this method is relatively simple, and it does not require too much computational cost when applied to the Adam optimization algorithm. Experiments show that it only takes about 0.5 s more per epoch when using the LeNet convolutional neural network model to train the Mnist handwritten digit recognition dataset.
The above formula can be written in matrix form as follows:
In the above formula,
In
Input: Weight vector 


Traning step:  
for 
4. 
1. 
5. 
2. 
6. 
3. 
7. 
8. 
To better verify the effectiveness of our algorithm, this paper uses the Cityscapes dataset, the urban landscape dataset, which contains various stereoscopic video sequences recorded from street scenes of 50 different cities, except for a larger 20,000 weakly annotated frames. In addition, there are highquality 5000frame pixellevel annotations. The Cityscapes dataset has two sets of evaluation criteria: fine and coarse. The former provides 5,000 finely annotated images, and the latter provides 5,000 finely annotated images plus 20,000 coarsely annotated images.
The Cityscapes dataset is designed to: (1) Evaluate the performance of vision algorithms on the main tasks of semantic urban scene understanding: pixellevel, instancelevel and panoramic semantic labels; (2) Support research aimed at leveraging large amounts of (weakly) annotated data, for example for training deep neural networks.
MIoU (Mean Intersection over Union) is the average intersection and union ratio, which is the current standard measure of semantic segmentation. It calculates the interaction ratio of the two sets. In the semantic segmentation problem, the two sets are the ground truth and the predicted value. The formula is as follows:.
In the formula,
The algorithms in this paper are built under the framework of Pytorch 1.2. The training and detection are based on the hardware configuration of the CPU is Intel(R) Core(TM) i79700 CPU@3.00 GHz, the GPU is NVIDIA GeForce RTX 2070 SUPER, 8 G video memory, and the number of CUDA cores is 2560, the running memory is 16 G, and the operating system is Windows10 computer platform.
According to the abovementioned improvements to the algorithm, they are combined to verify their effectiveness, and the changes in Loss during the training process are recorded. At the same time, the earlystop method in pytorch is used to prevent overfitting of the training model, which leads to poor model generalization ability. The maximum number of training iterations is set to 1000, and the model weights are saved every 50 generations. The Loss of the training process is shown in
In
In order to verify the effectiveness of the improved algorithms and training method in this paper, each experiment is performed on the Cityscapes training set, then each accuracy index test is performed on the validation set. The parameter settings are consistent with the overall accuracy test experiment. The visual comparison of some segmentation results is shown in
By comparing the segmentation effects of the basic UNet algorithm model and the UDAGC algorithm model, it is clear that the latter has an overall excellent performance in classification accuracy and positioning accuracy, which is significantly improved compared with the former. The segmentation effect on categories such as pedestrians, trees, vehicles, and roads are excellent, and the category to which it belongs can be basically identified. At the same time, the segmentation edge is also relatively smooth and accurate. However, due to the relatively low maximum number of iterations set, both of them cannot segment small objects or object edges well in the face of longdistance and complex scene segmentation, which is also an inevitable problem in segmentation area.
Meanwhile, due to the multiple network improvements and methods for the basic UNet semantic segmentation network, it is necessary to verify the effectiveness of each part, so that its effect on the overall network performance of the model can be quantitatively observed. The resulting data are shown in
UNet  DA  GC  MIoU/% 

√  73.67  
√  √  76.68  
√  √  √ 
Note: The bold part is the best value of this experiment.
Meanwhile, the MIoU of the UDAGC algorithm network can reach 78.02%, compared with the basic network and the UNet algorithm network after adding DA, which is 5.9% higher than the basic UNet, and 1.7% than UDA, it can be concluded that since the GC optimization algorithm can make the training process more stable and effective, the ability of the trained model to learn image features has also been enhanced, and the improved UNet algorithm can achieve better results.
In order to further verify the effectiveness of the improved algorithms and training method in this paper, this paper selects UDAGC and other latest semantic segmentation algorithm network: Deeplabv3 and SegNet to conduct comparative experiments. All experiments are carried out in the same experimental environment. And each experiment is performed on the Cityscapes training set, then each accuracy index test is performed on the validation set. The parameter settings are consistent with the overall accuracy test experiment. The visual comparison of some segmentation results and results data are in
Algorithm network  MIoU/% 

Deeplabv3  74.53 
SegNet  75.25 
UDAGC 
As is shown in
In this paper, depthwise separable convolution and attention mechanism are introduced on the basis of the basic network UNet, and a new training adjustment strategy of gradient compression is proposed at the same time. Through a series of experimental verifications, the following conclusions are obtained:
The improvement methods in this paper can meet the demand for a lightweight semantic segmentation network in the autonomous driving perception system, reduce the operation cost and improve the operation speed. It also provides support for the road condition analysis and realtime segmentation of the autonomous driving perception system.
The training optimization algorithm proposed in this paper can not only improve the generalization ability and segmentation accuracy of the training model but also has strong algorithm adaptability that can be easily added to other optimization algorithms.
Compared with the basic algorithm and the other latest semantic segmentation algorithms, the improved method in this paper has a considerable improvement in the segmentation accuracy of common road objects, especially the segmentation effect on the driving area, which is an important segmentation target in the autonomous driving system.
The data set used in this paper has less training data, and all of them are in the daytime traffic flow with a good line of sight. The segmentation effect for other weather or nighttime needs to be further researched.
In the process of segmentation, the problem of low segmentation accuracy is easy to occur when facing more complex driving scenes. To solve this problem, it is necessary to conduct more deep research on the feature extraction network.