3D reconstruction based on single view aims to reconstruct the entire 3D shape of an object from one perspective. When existing methods reconstruct the mesh surface of complex objects, the surface details are difficult to predict and the reconstruction visual effect is poor because the mesh representation is not easily integrated into the deep learning framework; the 3D topology is easily limited by predefined templates and inflexible, and unnecessary mesh self-intersections and connections will be generated when reconstructing complex topology, thus destroying the surface details; the training of the reconstruction network is limited by the large amount of information attached to the mesh vertices, and the training time of the reconstructed network is too long. In this paper, we propose a method for fast mesh reconstruction from single view based on Graph Convolutional Network (GCN) and topology modification. We use GCN to ensure the generation of high-quality mesh surfaces and use topology modification to improve the flexibility of the topology. Meanwhile, a feature fusion method is proposed to make full use of the features of each stage of the image hierarchically. We use 3D open dataset ShapeNet to train our network and add a new weight parameter to speed up the training process. Extensive experiments demonstrate that our method can not only reconstruct object meshes on complex topological surfaces, but also has better qualitative and quantitative results.

Image-based 3D reconstruction is the process of recovering 3D information from 2D image, with the aim of obtaining a 3D model that matches the 2D image. The advantages of single-view-based reconstruction methods are that they require less input data and usually only need to reconstruct the image of a single perspective of the object as input, so as to restore the whole shape of the object. With the emergence of large-scale 3D data sets such as ShapeNet [

Early single-view 3D reconstruction based on learning represented 3D structures as voxels [

To solve the above challenges, we propose a method for fast mesh reconstruction from single view based on Graph Convolutional Network (GCN) and Topology Modification. First, we use GCN with residual connection, namely G-ResNet [

The contributions of this paper are mainly as follows:

An end-to-end learning framework is proposed, which combines GCN and topology modification for the first time. It is used to reconstruct the 3D mesh model from a single image, considering the details of the reconstructed mesh surface and the flexibility of the mesh topology, and it can be applied to the reconstruction of complex structures with good generalization ability.

Different from the previous method that simply vectorizes the image, we propose a new feature fusion method that uses the features of the image at different stages multiple times to meet the input requirements of different modules and make each module compatible with each other. This module can be integrated into other learning frameworks.

A weight parameter related to the 3D loss function is proposed, which can give priority to the location of key points during training process, so as to achieve the purpose of improving the training speed. We use this to optimize our training methods and improve the stability of the network.

Human beings are good at using prior knowledge to make inferences and predictions [

According to different representations, the single-view-based 3D reconstruction technology is mainly divided into three directions: voxel, point cloud and mesh. Voxels discretize an object into a 3D voxel grid. Its advantage is that it is easy to integrate deep learning frameworks (such as 3D convolution and max pooling). Choy et al. proposed 3D-R2N2 based on the voxel representation and used the 3DLSTM network structure to establish a mapping from 2D graphics to 3D voxel models, which completed single-view or multi-view 3D reconstruction based on voxels [

In comparison, the point cloud is a simple, unified and easy-to-learn structure. Since the connectivity between vertices does not need to be updated, the point cloud is easier to manipulate during geometric transformation and deformation. Point Set Generation Network (PSGN) proposed by Fan et al. solves the problem of loss when training a point cloud network [

Polygon mesh is composed of vertices and triangular faces. It has the characteristics of scalability and curved surface, as well as light weight and rich shape details. The most important parts of mesh are the connections between adjacent points. N3MR [

In reality, many important data are stored in the form of graphs, such as social network information, knowledge graphs, protein networks, the World Wide Web, and so on. Since it is difficult for CNN to choose a fixed convolution kernel to adapt to the irregularities of the entire graph [

3D mesh can also be regarded as a graph topology, and there are corresponding connections between the vertices and edges of the mesh. Wang et al. proposed for the first time that GCN is applied to the grid 3D reconstruction of a single image [

However, the uncertainty in the design of the corresponding convolution parameters for the mesh is large, and the mesh representation is no suitable for conventional 3D convolution operations. Therefore, in this work, we use polygon mesh as the 3D format and introduce graph convolutional neural network (GCN) to control this structure and solve the problem of incompatibility between mesh representation and neural network.

The graph topology is controlled by the connection of vertices and edges, so a structure containing a large number of vertices like a 3D mesh is not easy to update, and it consumes more resources when performing convolution operations. Topology modification, as a technology to update the topology in real time, has been used in the design of broadband antenna structures [

Pan et al. applied topology modification to the 3D reconstruction of the mesh, which solved the problem that the deformation of the mesh is limited by the predefined shape template and can adapt to the reconstruction of the surface of complex objects [

As an end-to-end network structure, when we are given an input image, the system outputs a 3D mesh model. An overview of the framework of this article is shown in

The encoding layer is used to extract 2D image features hierarchically, convert the input image into feature maps and feature vectors, and input them to the deform block and the topology modification module, respectively. The deform block modifies the predefined sphere mesh. By manipulating the feature vector attached to the vertices, the vertices can be deformed, so that the sphere mesh gradually tends to the object described by the input image. The topology modification module dynamically trims the mesh surface after each deformation, so that the mesh topology is no longer limited to a predefined template. After each deformation of the vertices and modification of the topology, we use a boundary optimization loss function to trim the zigzag boundary and smooth the model surface. In order to make the network produce stable deformation and generate accurate meshes, we combine the commonly used 3D loss function with boundary refinement loss to train our network.

Next, we will introduce the encoding layer, deform block, topology modification module and the 3D loss function used one by one.

The encoding layer uses VGG-19 as the main architecture of the network. First, input the image and extract it into feature maps of different layers and 1000-dimensional feature vectors, as shown in

The whole process of image encoding can be reused. On the one hand, the three layers of feature maps, conv3_4, conv4_4 and conv5_4, are stitched together in series, and given any vertex in the 3D grid, its projection point on the input image is found according to the camera parameters, and the bilinear difference method finds and fuses the four pixels adjacent to the corresponding point of that vertex on the feature map as the feature vector for manipulating the deformation of that vertex in the grid deformation module. On the other hand, the 1024-dimensional vector formed by the whole VGG-19 network is an important benchmark to guide the topology modification. In this way, one feature extraction of the image satisfies the input requirements of both the mesh deformation and topology modification modules.

Define the mesh structure as

However, mesh only predicted by G-ResNet is prone to obvious self-intersection, so it is necessary to trim the topology to achieve a suitable visual effect.

In order to reduce the calculation of the deformation process and generate a more realistic 3D model, a topology modification process is added after each deform block to dynamically modify the topological relationship between the vertices and the surface of the mesh. A topology correction network is used to update the topological structure of the reconstructed grid by trimming the surfaces that clearly deviate from the ground truth.

Randomly sample points on the surface of the predicted grid topology M and connect the copied shape feature vector with the matrix containing all the sample points. Multilayer Perceptron (MLP) takes the spliced feature matrix as input and predicts the error distance of each vertex to the ground truth value. Calculate the average value of the prediction errors of all sampling points on the triangular surface of the grid and obtain the final error of each triangular surface.

We apply a threshold strategy to delete those faces whose errors exceed the predefined threshold, thereby updating the mesh topology. The threshold τ needs to be adjusted according to the actual situation to reach the most suitable grid structure for pruning. If the threshold τ is too high, it will reduce the trimming part and increase the reconstruction error; if the threshold τ is too low, it will delete too many triangular surfaces and destroy the topological structure of the mesh. Therefore, a coarse-to-fine method is adopted. First, a higher τ is given in the first module, and then τ is sequentially reduced in the subsequent modules to gradually refine the area to be trimmed.

Since the parameters of the network have not been trained at the initial stage, the 3D model after a round of mesh deformation and topology modification cannot achieve sufficient accuracy, so we use the corresponding 3D loss function to repeat the process many times until the generated model error is within the expected range.

In this paper, the network is trained by 3D ground truth to constrain the deformation results of the mesh. The loss function is mainly based on Chamfer Distance

Chamfer loss. Chamfer Distance, as the most common constraint function in the field of 3D reconstruction, was originally used in the point cloud collection to represent the difference between the predicted vertex and the ground truth. Its main function is to limit the position of the vertex, gradually approaching the ground truth. If the loss is large, the difference between the two sets of vertices is large; if it is small, the reconstruction effect is better. The Chamfer loss can be defined as:

Earth Mover’s loss. Earth Mover’s Distance is defined as the minimum sum of the distances between a point in one set and a point in another set on all possible corresponding arrangements. Earth Mover’s loss can be defined as:

Through Chamfer loss and Earth Mover’s loss, the vertices can be gradually returned to the appropriate position, but it is not enough to produce a well-structured and stable mesh. Inspired by the work of Pixel2Mesh [

Here,

Therefore, the Chamfer loss and Earth Mover’s loss can be further defined as:

Boundary regularize. Sine the topological trimming of the mesh model will leave a jagged edge, which greatly destroys the visual appearance of the reconstructed mesh. In order to further improve the visual quality of the reconstructed mesh, we incorporate a boundary regularization term in the original loss, and penalize zigzag by forcing the boundary curve to remain smooth and consistent:

Here,

Therefore, the final training goal of the model can be defined as:

Here,

Figures and tables should be inserted in the text of the manuscript.

Next, we will introduce our experimental setup and details.

The dataset ShapeNet is used for training, which contains 13 different object categories and corresponding 50,000 model images. We divide the dataset into a training set and a testing set. On the testing set, we can determine when to stop training by tracking the loss size of the method and all benchmarks.

On the basis of following the standard 3D shape reconstruction evaluation method, we use two different numerical indicators to evaluate the performance of the model and compare with the existing advanced technology. The Chamfer Distance (CD) and Earth Mover’s Distance (EMD) can be used both in training and testing. They are able to measure the error of the vertices between the predicted meshes and ground truth. When the two results are smaller, the experimental effect is better.

We compare the proposed method with some existing 3D reconstruction techniques. Specifically, such as Deep Marching Cubes and PSGN, which are the more influential methods in volume reconstruction and point cloud reconstruction, respectively. In addition, we also compare Pixel2Mesh and TMNet in mesh reconstruction.

The input image size is set to 224*224. First, we pre-train the network structure shown in

As shown in

We uniformly sample 1000 points on the surface of the generated model, and measure CD and EMD between them and the real point cloud of ground truth. Since PSG only generates the point cloud of the target, the ball-pivoting algorithm [

Category | CD↓ | ||||
---|---|---|---|---|---|

PSG | Deep marching cubes | Pixel2Mesh | TMNet | Ours | |

Chair | 6.647 | 5.415 | 4.932 | 4.850 | |

Airplane | 2.353 | 4.400 | 1.570 | 1.370 | |

Lamp | 2.740 | 3.292 | 2.828 | 3.295 | |

Table | 7.065 | 5.383 | 4.271 | 3.679 | |

Firearm | 2.186 | 4.907 | 1.790 | 1.754 | |

Mean | 4.198 | 4.679 | 3.078 | 2.836 |

Category | EMD↓ | ||||
---|---|---|---|---|---|

PSG | Deep marching cubes | Pixel2Mesh | TMNet | Ours | |

Chair | 13.809 | 13.266 | 12.106 | 11.256 | |

Airplane | 9.122 | 10.601 | 7.953 | 8.012 | |

Lamp | 12.174 | 11.630 | 10.457 | 8.637 | |

Table | 14.804 | 12.712 | 11.707 | 9.334 | |

Firearm | 7.696 | 9.412 | 7.590 | 7.769 | |

Mean | 11.521 | 11.524 | 9.962 | 8.958 |

Now we conduct an ablation experiment to analyze the importance of each component in the entire model.

Category | CD↓ | EMD↓ |
---|---|---|

-Deform blocks (both) | N/A | N/A |

-Topology modification (both) | 5.071 | 12.698 |

-Deform block (Subnet-2) | 6.249 | 15.463 |

-Topology modification (Subnet-2) | 4.619 | 10.725 |

-Boundary refinement | 4.087 | 11.311 |

Full model | 4.212 | 10.224 |

We first remove the deform blocks in the two subnets, and directly perform topology modification and boundary refinement on the initial 3D sphere. It can be observed that the undeformed sphere lacks GCN’s control over the topology, and a large number of error surfaces are predicted. Therefore, the topology modification trims most of the surfaces and destroys the original mesh topology, leaving only some Remaining grid fragments. Since the training result contains only a few vertices and mesh faces, we cannot perform sampling point analysis on them, as shown in

Second, we remove the topology modification modules in the two subnets and re-train the network. The generated model has a specific 3D shape, but there are some self-intersecting connections between the error surface and the grid. In particular, unnecessary connections exist in the thinner parts such as the chair legs and armrests. The reason is the lack of error prediction and surface trimming for topology modification, which only maintains the basic posture of the reconstructed object; at the same time, GCN will not break the constraints of spherical topology to form such a “hollow” surface.

After clarifying the indispensability of these two modules to the model, we also conduct ablation experiments and analysis on the number of modules. After training Subnet-1, we remove the deform blocks in both subnets and topology modification modules in Subnet-2. As shown in the detailed results in

Finally, we find that the discontinuous surface with more complex structure (such as the office chair in

Based on GCN and topology modification technology, we propose an improved end-to-end network architecture that can quickly generate 3D mesh models with complex topologies from a single perspective. Through the iterative use of GCN and topology modification, the problem that the high-quality surface reconstruction effect and the high flexibility of the topological structure cannot be achieved is solved. At the same time, the feature fusion method we propose uses hierarchical input to make full use of the various stages of the image and solve the problem of data input incompatibility between modules; in addition, the proposed weight parameters can help the network pay attention to the backbone position during training and reduce training consumption. A large number of experiments and measurement results show that the method in this paper can have a good reconstruction effect on common categories (especially categories with complex topological structures).

For future work, we will test our algorithm on other 3D data sets, such as Pix3D with pixel-level 2D-3D correspondence and Pascal3D + [

Thanks to the supervisor for writing guidance and other colleagues in the laboratory for their help.