The three-dimensional (3D) reconstruction technology based on structured light has been widely used in the field of industrial measurement due to its many advantages. Aiming at the problems of high mismatch rate and poor real-time performance caused by factors such as system jitter and noise, a lightweight stripe image feature extraction algorithm based on You Only Look Once v4 (YOLOv4) network is proposed. First, Mobilenetv3 is used as the backbone network to effectively extract features, and then the Mish activation function and Complete Intersection over Union (CIoU) loss function are used to calculate the improved target frame regression loss, which effectively improves the accuracy and real-time performance of feature detection. Simulation experiment results show that the model size after the improved algorithm is only 52 MB, the mean average accuracy (mAP) of fringe image data reconstruction reaches 82.11%, and the 3D point cloud restoration rate reaches 90.1%. Compared with the existing model, it has obvious advantages and can satisfy the accuracy and real-time requirements of reconstruction tasks in resource-constrained equipment.

Optical three-dimensional (3D) measurement technology [

Algorithms for obtaining depth information (or unfolding phase) from fringe images usually require two main steps: phase extraction represented by phase shifting and Fourier transform methods [

Parameter compression on the constructed YOLOv4 model can well solve the contradiction between the huge network model and the limited storage space. The currently widely used model parameter compression method has weighted parameter quantization [

The weight parameter quantization can achieve the purpose of reducing resource consumption by reducing the accuracy of the weight. For example, in common development frameworks [

This paper uses the YOLOv4 network model to extract the features of the striped structured light image. Considering that the features of the striped image are not obvious due to the influence of illumination and noise, the feature extraction network model is improved. The algorithm first uses Mobilenetv3 structure to replace Cross-stage partial Darknet53 (CSPDarknet53) network of YOLOv4 to reduce the amount of backbone network parameters, and then introduces the Mish activation function and the CIoU loss function to calculate the improvement of the target frame regression loss, which effectively improves the generalization of feature extraction.

The principle of the fringe structured light 3D reconstruction algorithm is shown in

The phase shift method is one of the commonly used methods of the fringe structured light 3D reconstruction technology. By projecting a series of fringe images with a phase shift of

The wrapping phase is discontinuous, and the value range is between

Finally, the mapping expression between the unfolding phase and the height is determined and calibrated the mapping coefficients to realize the conversion of depth data and phase data of the measured object, and obtain the 3D topography information of the object surface.

YOLOv4 is mainly composed of Backbone, Neck and Head, as shown in

YOLOv4 network model is improved from two aspects: using the MobileNetV3 structure to replace the backbone feature extraction network of YOLOv4, and greatly reducing the number of backbone network parameters through the deep separable convolution in Mobilenetv3; introducing Mish activation function and CIoU loss function calculation to improve target frame regression loss, effectively improve the generalization of feature extraction.

YOLOv4 algorithm uses the CSPDarknet53 network as the feature extraction network, which contains 5 residual blocks, which are respectively stacked by 1, 2, 8, 8, and 4 residual units. The algorithm has a total of 104 convolutional networks, including 72 convolutional layers, and uses a large number of standard 3 × 3 convolution operations. A large amount of computing resources are used in the calculation process, which makes it difficult to achieve real-time performance. With the transfer of multi-layer features, more convolutional layers will gradually reduce the ability of local refined feature extraction, which affects the detection performance of the algorithm for small features. Therefore, it is necessary to improve the YOLOv4 feature extraction network to meet the small target detection and real-time requirements.

The MobileNet network uses the depth separable convolution calculation to convert the traditional convolution into a deep convolution and a 1 × 1 dot convolution, and introduces a width multiplier and a resolution multiplier to control the amount of model parameters. Mobile NetV3 is the third generation of Mobile Net network development. It combines the deep separable convolution method in MobileNetV1, the Inverted Residuals, Linear Bottleneck and the Squeeze-and-Excitation (SE) attention mechanism in MobileNetV2. MobileNetV3 uses neural architecture search (NAS) to search for network configuration and parameters, while improving the swish activation function to reduce the amount of calculation for h-swish, which can achieve less calculation and higher accuracy. The Mobile Net network first uses three 3*3 convolution kernels to convolve with each channel of the input feature map to obtain a feature map with an input channel equal to the output channel, and then uses N 1*1 convolution kernels to convolve this feature map to obtain a new N-channel feature map. Compared with the CSPDarknet53 network, it not only maintains a relatively powerful feature extraction capability, but also reduces the size of the model to a large extent, making it more convenient to deploy in the mobile terminal of the industrial field. At the same time, it has less network depth than the CSPDarknet53 network, which can better extract local refined features and improve the feature detection performance of small targets.

The model is trained with a self-regular non-monotonic Mish activation function, which can ensure the effective return of training loss, and obtain better generalization ability and more accuracy while ensuring the convergence speed. The calculation formula is:

In order to detect the target more accurately, the training loss is composed of the weighted sum of bounding box regression loss, confidence loss and classification loss, and calculates the return gradient. The calculation formula is:

By changing

In order to verify the reliability of the algorithm and the effect in the actual measurement, a set of grating three-dimensional projection measurement system composed of a projector and a camera was built, as shown in the

The experimental steps are as follows:

Generate sine grating fringes, where a four-step phase shift fringe pattern is used.

Project the sine grating fringe pattern to the homogeneous whiteboard, and collect the grating fringe modulated by the surface of the object.

Use the training data to train the YOLOv4 network model to obtain the mapping between the fringe image and the depth image.

Use the trained network to obtain the depth data of the fringe image.

For deep learning network training, the training rounds are uniformly set to 100, the batch size is 16, the initial learning rate is 1e-3, and the initial weights are all set to 1. There are a total of 5012 photos in the training set. In each round of training, 90% of the photos are used for training, and the other 10% of the photos are used for real-time detection of the training effect. This experiment will select a set of weight files with the lowest loss in each round to compare mAP size, model size, and real-time detection Frames Per Second (FPS).

As shown in

Model | mAP | Model size | FPS |
---|---|---|---|

YOLOv4 | 83.64% | 220 M | 6.33 |

YOLOv4 + Mobilenetv3 | 77.48% | 50 M | 14.35 |

Improved algorithm in this paper | 82.11% | 52 M | 13.67 |

According to the built experiment system and trained deep learning model, 3D reconstruction calculations are performed on objects with a simple shape and a complex shape respectively. The experiment uses a high-speed visual processor for training, and uses pre-training weights to train the original YOLOv4 network and the improved YOLOv4 model in this article. Finally, the results of the above three models are compared.

Model | Average phase error | Point cloud restoration rate | Operation time |
---|---|---|---|

YOLOv4 | 0.12 | 73.4% | 7.23 |

YOLOv4 + Mobilenetv3 | 0.057 | 80.4% | 4.32 |

Improved algorithm in this paper | 0.034 | 90.1% | 3.64 |

Model | Average phase error | Point cloud restoration rate | Operation time |
---|---|---|---|

YOLOv4 | 0.054 | 81.2% | 8.21 |

YOLOv4 + Mobilenetv3 | 0.034 | 84.4% | 5.01 |

Improved algorithm in this paper | 0.012 | 92.3% | 3.13 |

By simulating the 3D reconstruction process of two different objects, compared with the simple model of first example, the model of second example is more complex, has more abundant fringe features, and be convenient to obtain the phase change, so it is better than the first example in reconstruction accuracy and speed. At the same time, it can be concluded from the simulation results of the three algorithms: the lightweight YOLOv4 model in this paper is superior to the other two models in terms of average phase error, point cloud restoration rate and running time, but it still needs further research at the sub-pixel level in the detail reconstruction.

Based on the 3D model of striped structured light construction, this paper proposes a stripe image feature extraction algorithm based on lightweight YOLOv4. The advantage of this model is that it uses a lightweight Mobile Net network to replace the CSPDarknet backbone network in YOLOv4, which simplifies the network structure and improves the real-time performance of model detection; uses the Mish activation function and the CIoU loss function to calculate and improve the target frame regression loss, which is effective Improved feature detection accuracy and real-time performance. The experimental results show that, compared with the existing 3D reconstruction methods, the depth information calculated by the proposed method has higher accuracy and improves the accuracy of the 3D measurement results of fringe images. Therefore, it can be effectively used in the field of fringe projection 3D measurement and is better to meet the needs of 3D shape measurement of objects in scientific research and practical applications. The next step will continue to study the effectiveness of the proposed method in other more experimental scenarios, such as the effectiveness and accuracy of the fringe image depth estimation in the case of colored objects, high-light objects, and projection out-of-focus conditions. On the other hand, the generalization ability of the model is a common problem in deep learning, and it is also a key issue that needs to be paid attention to in the next work to improve the proposed method.

The authors thank Dr. Jinxing Niu for his suggestions. The authors thank the anonymous reviewers and the editor for the instructive suggestions that significantly improved the quality of this paper.