To obtain the most intuitive pedestrian target detection results and avoid the impact of motion pose uncertainty on real-time detection, a pedestrian target detection system based on a convolutional neural network was designed. Dynamic Selection of Optional Feature (DSOF) module and a center branch were proposed in this paper, and the target was detected by an anchor-free method. Although almost all the most advanced target detectors use pre-defined anchor boxes to run through the possible positions, scales, and aspect ratios of search targets, their effectualness, and generalization ability are also limited by the anchor boxes. Most anchor-based detectors use heuristically guided anchor frames. Such a design is difficult to detect objects of different types and sizes, especially objects with highly overlapping boundaries. To solve this problem, the DSOF module is proposed in this paper, which selects for each instance the most appropriate feature layer through automatic feature selection. After using multi-level prediction, stacks of low-grade prediction bounding boxes will be generated far away from the target center. To eliminate these low-grade detections, we introduce a new center branch to predict the deviation of a pixel from its corresponding bounding box. This score is used to reduce the weight of the low-grade detection bounding box and merge the detection sequences into the Non-Maximum Suppression (NMS).

Recognition and localization are two primary missions to be tackled in objective detection. For arbitrary figures, semantic object instances should be judged whether exist by object detector from predefined classes, if exist, the spatial position and range of instance will be returned. To add positioning function, sliding window approaches [

However, there are some disadvantages of utilizing the anchor frame. First of all, hyperparameters will increase because of the introduction of an anchor frame [

The object detection [

The extraction process of RoI is removed in a single-stage method, by which the candidate anchor boxes are directly classified and regressed. Compared with other methods, there are fewer anchor boxes employed to regress and classify in You Only Look Once (YOLO) [

Feature pyramids and multi-level feature pyramids are common object detection structures. The method of predicting class scores and bounding boxes from multi-feature scales is first proposed in SSD.

The Intersection over Union (IOU) loss function is proposed in UnitBox [

The target detection is reformulated based on pixel-by-pixel prediction in this part. Multi-level prediction using the feature pyramid can effectively strengthen the recall rate and unravel the ambiguity caused by overlapping bounding boxes, but it will produce multiple low-grade predicted bounding boxes. In the past, when using feature pyramids, heuristic features were generally used, but this method can generally not select the optimal feature layer. Our method mainly reduces the number of low-grade prediction bounding boxes and selects the optimal feature layer of the feature pyramid.

In the FCOS network, for each location (x, y) on the feature map F_{i}, it will be mapped back onto the input image as (xs + s/2, ys + s/2), where s represents the total stride. If the location (x, y) falls into any ground-truth box, it will be regarded as a positive sample_{i}}, where ^{i} is the class to which the object in the bounding box belongs. C is the number of classes, the value of which is 20 for the VOC dataset. Taking the center point of the bounding box _{1}:

Taking the center point of the bounding box _{2}:\

If the position (x, y) falls within the ellipse, it is marked as a positive sample _{1}, it is marked as a negative sample _{1} and _{2},

Here

Where _{cls} denotes focal loss, _{reg} betokens IOU loss, _{pos} denotes the number of positive samples _{reg} to 1 and _{x,y} and _{x,y} are points (

A fully convolutional network is composed of a backbone network and two subnets divided by tasks. As shown in ^{I} resolution of the input image. FPN uses a top-down architecture with horizontal connections to build a pyramid of features in the network from a single-scale input. Each layer of the pyramid can be used to detect objects of different scales.

In the structure proposed in this paper, a 3 × 3 convolutional layer with K filters is added to the feature map in the classification subnet, and then the sigmoid function, so that the object of K object classes at each spatial position can be predicted with probability. In addition, a 3 × 3 convolutional layer with four filters is added to the feature map of the regression subnet, and then the ReLU function is used to predict the box offset. After using FRN’s multi-level prediction, there will be multiple low-grade prediction bounding boxes far away from the center of the object. We propose an uncomplicated and profitable strategy to eliminate these low-grade detection bounding boxes without introducing any hyperparameters. Representatively, we have added a separate branch parallel to the classification branch (as shown in

The centrality is calculated by

The square root utilized here is to slow down the attenuation of centrality. The center is from 0 to 1, so training binary cross-entropy (BCE) loss. In the test, the final score is calculated by multiplying the predicted centrality by the corresponding classification score (used to sort the detected bounding boxes). Therefore, the center attribute can reduce the weight of the bounding box far from the center of the object, and finally use

The ground truth output of the classification is K feature maps, and each feature map corresponds to a class. This example affects the k ground truth map in two ways. First, the effective frame _{2} and ellipse E_{1}, which signifies that the gradient of this area does not propagate back to the network. It should be noted that if two objects overlap at the same level, the smaller object is preferred. The remaining part is filled with black, indicating that it is a negative sample and there is no object. In the case of high parameters α = 0.25 and γ = 2.0, focal loss is used for monitoring. For an image, the total classification loss of an anchorless branch is the sum of the focal loss of all non-ignored areas, normalized by the total number of pixels in all valid frame areas.

The ground truth of the regression output is 4 offset mappings irrespective of the category. The instance only affects the _{i}/S, and each map corresponds to a dimension. S is the normalization constant, we choose S = 4.0 based on experience. The position outside the effective frame is the area where the gradient is ignored. Optimized based on IOU loss. For an image, the total regression loss of an anchorless branch is the average of the IOU loss of all effective regions.

The anchorless design allows the use of the characteristics of any layer _{i} according to the content of the instance, rather than the size of the instance box like the anchor-based method.

Given an instance I, define its classification loss function and regression loss function on P_{i} as _{i}, respectively.

_{i} learning instance with the smallest loss, that is:

For training batches, the features will be updated for their corresponding assigned instances. Intuition tells us that the selected feature is currently the optimal choice for modeling the instance. Its loss forms a lower limit in the feature space. Through training, this lower limit is further pulled down. During inference, there is no need to select features, because the most appropriate feature pyramid level will transparently output high confidence values.

Pascal VOC [

We use centrality and DSOF to evaluate our method, and the evaluation indicators are AP and mAP values, as shown in

Method | car | dog | train | bus | horse | boat | bird | person | mAP |
---|---|---|---|---|---|---|---|---|---|

Baseline | 0.78 | 0.78 | 0.72 | 0.72 | 0.68 | 0.65 | 0.52 | 0.50 | 64.1 |

+center | 0.79 | 0.78 | 0.93 | 0.83 | 0.92 | 0.66 | 0.68 | 0.59 | 73.68 |

++DSOF | 0.80 | 0.92 | 0.96 | 0.86 | 0.96 | 0.72 | 0.80 | 0.76 | 84.24 |

See

True1 | True 0 | |
---|---|---|

Predicted 1 | Ture Positive(TP) | False Positive(FP) |

Predicted 0 | False Negative(FN) | True Negative(TN) |

Precision = TP/(TP+FP), FP is false positive. Then, the precision-recall curve reflects the tradeoff between the recognition accuracy of positive cases and the coverage ability of positive cases. For a random classifier, its Precision is fixed equal to the proportion of positive examples in the sample and does not change with the change of recall. A PR curve can be drawn for each category of multiple categories. In the curve, a set of coordinates composed of precision and recall can be obtained by changing the confidence of 10%-100% in sequence, and these values are connected as PR curves.

We can learn from

In

Method | Train time(sec/img) | Test time(sec/img) | mAP(%) |
---|---|---|---|

Faster R-CNN | 1.2 | 0.42 | 73.8 |

R –FCN | 0.45 | 0.17 | 77.6 |

Yolov3 | 0.34 | 0.40 | 79.25 |

0.32 | 0.39 | 81.12 | |

DSOF | 0.42 | 0.37 | 84.24 |

In target detection, we have proposed a simple and effective DSOF method. This module applies real-time online feature selection to train anchorless branches in the feature pyramid, avoiding all calculations and hyperparameters related to anchor boxes, and predicting pixel by pixel. The method solves the target detection,

The author would like to thank the researchers in the field of object detection and other related fields. This paper cites the research literature of several scholars. It would be difficult for me to complete this paper without being inspired by their research results. Thank you for all the help we have received in writing this article.