Real-time pedestrian detection is an important task for unmanned driving systems and video surveillance. The existing pedestrian detection methods often work at low speed and also fail to detect smaller and densely distributed pedestrians by losing some of their detection accuracy in such cases. Therefore, the proposed algorithm YOLOv2 (“YOU ONLY LOOK ONCE Version 2”)-based pedestrian detection (referred to as YOLOv2PD) would be more suitable for detecting smaller and densely distributed pedestrians in real-time complex road scenes. The proposed YOLOv2PD algorithm adopts a Multi-layer Feature Fusion (MLFF) strategy, which helps to improve the model’s feature extraction ability. In addition, one repeated convolution layer is removed from the final layer, which in turn reduces the computational complexity without losing any detection accuracy. The proposed algorithm applies the K-means clustering method on the Pascal Voc-2007+2012 pedestrian dataset before training to find the optimal anchor boxes. Both the proposed network structure and the loss function are improved to make the model more accurate and faster while detecting smaller pedestrians. Experimental results show that, at

One of the most important applications of Computer vision (CV) in self-driving cars is pedestrian detection. The field of pedestrian detection covers video surveillance, criminal investigations, self-driving cars, and robotics. Real-time pedestrian detection is an important task for unmanned driving systems. The vision system of autonomous vehicle technology was initially very difficult to develop in the field of CV; however, owing to continuous improvements of hardware computational power, many researchers have attempted to develop reliable vision systems for self-driving cars. Since 2012, deep learning has been developed and achieved tremendous progress in the field of CV. In the field of artificial intelligence, many deep learning-based algorithms have been introduced and used in a wide range of applications, such as in signal, audio, image, and video processing. In particular, deep learning-based algorithms play a groundbreaking role in fields such as image and video processing, for example, image classification and detection.

One of the direct applications of real-time pedestrian detection is that it should automatically locate pedestrians accurately with on-shelf cameras, since it plays a crucial role in robotics and unmanned driving systems. Despite tremendous progress having been achieved recently, this task still remains challenging due to the complexity of road scenes, such as them being crowded, occluded, containing deformations and exhibiting lighting changes. Currently, unmanned driving systems are among the major fields of research in CV, for which the real-time detection of pedestrians is essential to avoid possible accidents. Although deep learning-based techniques improve detection accuracy, there is still a huge gap between human and machine perception [

This is the major drawback of reliable vision-based detection systems since self-driving cars in real-time extremely complex environments should be able to detect objects in the daytime or at night. Nevertheless, current state-of-the-art (SOTA) real-time pedestrian detection still falls short of the fast and accurate human perception levels [

Currently, pedestrian detection methods are classified into two time slots: traditional and deep learning time slot methods. Traditional time slot methods cover various traditional machine learning algorithms such as Voila Jones detector [

Generally, the speed of deep learning-based object detection methods is low, with these methods being unable to meet real-time requirements of self-driving cars. Therefore, to improve both speed and detection accuracy, Redmon et al. [

To improve both detection accuracy and speed when detecting smaller and densely distributed pedestrians, a new pedestrian detection technique is proposed, YOLOv2-based pedestrian detection (in short, YOLOv2PD). An efficient K-means clustering [

The contributions of the proposed work can be summarized as follows:

The proposed YOLOv2PD model adopts the MLFF strategy to improve the model’s feature extraction ability and, at the higher end, one convolution layer is eliminated.

Moreover, intuitively, to test the effectiveness of the proposed model, another model referred to as YOLOv2 Model A is implemented and compared.

The loss function is improved by applying normalization, which reduces the effect of different pedestrian sizes in an image, and which potentially optimizes the detected bounding boxes.

Through qualitative and quantitative experiments conducted on Pascal Voc-2007+2012 Pedestrian, INRIA and Caltech pedestrian datasets, we validate the effectiveness of our algorithm, showing that it has better detection performance on smaller pedestrians.

The rest of the paper is organized as follows. Sections 2 covers related work. In Section 3, the proposed YOLOv2PD algorithm is illustrated. Section 4 covers the benchmark datasets Pascal Voc-2007+2012 Pedestrian, INRIA and Caltech; the experimental results and analysis are discussed. Finally, the conclusion is presented and future works are discussed.

The research field of pedestrian detection has existed for several decades, in which different technologies have been employed for this detection, many of which have had significant impacts. Some methods aim to improve the basic features utilized [

Benenson et al. [

Li et al. [

Song et al. [

Lin et al. [

Specifically, two-stage deep learning-based object detectors offer advantages in achieving both higher localization accuracy and precision. The process requires huge resources and yet the computational efficiency is low. Owing to the unified network structures, one-stage detectors are much faster than two-stage detectors, even though the model precision decreases. Moreover, the amount of training data plays a vital role in deep learning-based object detectors. We present an end-to-end single deep neural network for detecting smaller and densely distributed pedestrians in real time inspired by YOLOv2. YOLOv2 (“You only look once version 2”) [

The proposed method YOLOv2PD adopts the YOLOv2 deep learning framework [

The proposed method applies a K-means clustering algorithm on the Pascal Voc-2007+2012 pedestrian dataset during training and selects the optimal number of anchor boxes of different sizes. It works by replacing traditional Euclidean distance with the distance function of YOLOv2 while implementing the K-means clustering algorithm. Therefore, the error obtained is made irrelevant with respect to anchor box sizes by adopting IoU as an evaluation metric, as shown in

where box is the sample; centroid is cluster center point; IoU (box, centroid) is the overlap ratio between cluster and center boxes. Based on the clustering results analysis, the K value was chosen to be 6; therefore, six different anchor box sizes would be applied in order to improve the positioning accuracy. Finally, by implementing the K-means clustering algorithm on the training dataset, a suitable number of different anchor box sizes are selected for pedestrian detection, which in turn improves the positioning accuracy.

Since images are captured using a video surveillance camera, some of the pedestrian images might be bigger, with pedestrians being nearer the camera, while some pedestrian images might be smaller, with pedestrians being located far away from the camera during detection. Therefore, pedestrians would appear smaller in the image when they are far from the camera, and vice versa. As such, sizes may vary in the captured images, even though the pedestrian is identical.

During YOLOv2 training, objects of different sizes show different effects on the network and produce large errors, particularly for images with smaller and densely distributed objects. To overcome this drawback, loss calculation for bounding box (BB) width and height is improved by applying normalization.

where (_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}(c)^{2}:

Multi-layer Feature Fusion (MLFF) Approach: In pedestrian detection, variations among pedestrians include occlusion, illumination changes, color, height, and contour, whereas local features exist only in the lower layers of CNN. Therefore, to use local features fully, an MLFF approach was implemented in YOLOv2PD. The Reorg aim is to keep feature maps of those layers the same. Part (a) passes through the following

YOLOv2 is a fast and accurate object detection model. The YOLOv2 network can detect 9000 classes and variations among multiple objects are wider, such as cell phones, cars, fruits, sofas, and dogs. There are three repeated

A novel YOLOv2PD network structure is designed by adopting the MLFF approach and one unwanted convolutional layer is removed at the higher end. Moreover, intuitively, to test the effectiveness of the proposed model, another model, referred to as YOLOv2 Model A, was implemented and compared. The YOLOv2 Model A removed two

Layer No. | YOLOv2 | YOLOv2 Model A | YOLOv2PD |
---|---|---|---|

L0 | Conv_3^{*}3_416^{*}416^{*}32 |
Conv_3^{*}3_416^{*}416^{*}32 |
Conv_3^{*}3_416^{*}416^{*}32 |

L1 | Maxpool/2 | Maxpool/2 | Maxpool/2 |

L2 | Conv_3^{*}3_208^{*}208^{*}64 |
Conv_3^{*}3_208^{*}208^{*}64 |
Conv_3^{*}3_208^{*}208^{*}64 |

L3 | Maxpool/2 | Maxpool/2 | Maxpool/2 |

L4 | Conv_3^{*}3_104^{*}104^{*}128 |
Conv_3^{*}3_104^{*}104^{*}128 |
Conv_3^{*}3_104^{*}104^{*}128 |

L5 | Conv_1^{*}1_104^{*}104^{*}64 |
Conv_1^{*}1_104^{*}104^{*}64 |
Conv_1^{*}1_104^{*}104^{*}64 |

L6 | Conv_3^{*}3_104^{*}104^{*}128 |
Conv_3^{*}3_104^{*}104^{*}128 |
Conv_3^{*}3_104^{*}104^{*}128 |

L7 | Maxpool/2 | Maxpool/2 | Maxpool/2 |

L8 | Conv_3^{*}3_52^{*}52^{*}256 |
Conv_3^{*}3_52^{*}52^{*}256 |
Conv_3^{*}3_52^{*}52^{*}256 |

L9 | Conv_1^{*}1_52^{*}52^{*}128 |
Conv_1^{*}1_52^{*}52^{*}128 |
Conv_1^{*}1_52^{*}52^{*}128 |

L10 | Conv_3^{*}3_52^{*}52^{*}256 |
Conv_3^{*}3_52^{*}52^{*}256 |
Conv_3^{*}3_52^{*}52^{*}256 |

L11 | Maxpool/2 | Maxpool/2 | Maxpool/2 |

L12 | Conv_3^{*}3_26^{*}26^{*}512 |
Conv_3^{*}3_26^{*}26^{*}512 |
Conv_3^{*}3_26^{*}26^{*}512 |

L13 | Conv_1^{*}1_26^{*}26^{*}256 |
Conv_1^{*}1_26^{*}26^{*}256 |
Conv_1^{*}1_26^{*}26^{*}256 |

L14 | Conv_3^{*}3_26^{*}26^{*}512 |
Conv_3^{*}3_26^{*}26^{*}512 |
Conv_3^{*}3_26^{*}26^{*}512 |

L15 | Conv_1^{*}1_26^{*}26^{*}256 |
Conv_1^{*}1_26^{*}26^{*}256 |
Conv_1^{*}1_26^{*}26^{*}256 |

L16 | Conv_3^{*}3_26^{*}26^{*}512 |
Conv_3^{*}3_26^{*}26^{*}512 |
Conv_3^{*}3_26^{*}26^{*}512 |

L17 | Maxpool/2 | Maxpool/2 | Maxpool/2 |

L18 | Conv_3^{*}3_13^{*}13^{*}1024 |
Conv_3^{*}3_13^{*}13^{*}1024 |
Conv_3^{*}3_13^{*}13^{*}1024 |

L19 | Conv_1^{*}1_13^{*}13^{*}512 |
Conv_1^{*}1_13^{*}13^{*}512 |
Conv_1^{*}1_13^{*}13^{*}512 |

L20 | Conv_3^{*}3_13^{*}13^{*}1024 |
Conv_3^{*}3_13^{*}13^{*}1024 |
Conv_3^{*}3_13^{*}13^{*}1024 |

L21 | Conv_1^{*}1_13^{*}13^{*}512 |
Conv_1^{*}1_13^{*}13^{*}512 |
Conv_1^{*}1_13^{*}13^{*}512 |

L22 | ^{*}3_13^{*}13^{*}1024 |
^{*}3_13^{*}13^{*}1024 |
^{*}3_13^{*}13^{*}1024 |

L23 | ^{*}3_13^{*}13^{*}1024 |
Route-L16 | ^{*}3_13^{*}13^{*}1024 |

L24 | ^{*}3_13^{*}13^{*}1024 |
Conv_3^{*}3_13^{*}13^{*}512 |
Route-L6 |

L25 | Route-L16 | Conv_1^{*}1_13^{*}13^{*}64 |
Conv_3^{*}3_13^{*}13^{*}128 |

L26 | Conv_1^{*}1^{*}_13^{*}13^{*}64 |
Reorg | Conv_1^{*}1_13^{*}13^{*}32 |

L27 | Reorg | Route-L26 L22 | Reorg |

L28 | Route-L27 L24 | Conv_3^{*}3_13^{*}13^{*}1024 |
Route-L10 |

L29 | Conv_3^{*}3_13^{*}13^{*}1024 |
Conv_1^{*}1_13^{*}13^{*}30 |
Conv_3^{*}3_13^{*}13^{*}256 |

L30 | Conv_1^{*}1_13^{*}13^{*}30 |
Detection | Conv_1^{*}1_13^{*}13^{*}64 |

L31 | Detection | Reorg | |

L32 | Route-L16 | ||

L33 | Conv_3^{*}3_13^{*}13^{*}512 |
||

L34 | Conv_1^{*}1_13^{*}13^{*}64 |
||

L35 | Reorg | ||

L36 | Route-L35 L31 L27 L23 | ||

L37 | Conv_3^{*}3_13^{*}13^{*}1024 |
||

L38 | Conv_1^{*}1_13^{*}13^{*}30 |
||

Detection |

Pascal Voc-2007+2012 dataset [

The INRIA Pedestrian dataset [

The Caltech pedestrian dataset [

Datasets | Training Images | Testing Images |
---|---|---|

Pascal Voc-2007+2012 Pedestrian | 9072 | 1008 |

INRIA | 614 | 228 |

Caltech Pedestrian | 4250 | 4024 |

The experiments were carried out on a workstation during the training phase; the testing phase was also performed on the same workstation. Darknet was chosen as a feature extractor for all of the models, which was trained on a huge ImageNet dataset. The experimental setup of the workstation is Windows 10 pro OS, Intel Xeon 64-bit CPU @3.60 GHz, 64 GB RAM, Nvidia Quadro P4000 GPU, CUDA 10.0 & CUDNN 7.4 GPU acceleration library and Tensorflow 1.x deep learning framework.

The model training was carried out on Pascal Voc-2007+2012 Pedestrian dataset (9072) training images and tested on 1008 testing images, since we are only concerned with pedestrian images. The input image size is resized to

Average precision (AP) and inference speed (FPS-Frames per second) are the standard techniques preferred to evaluate the model performance. Intersection over union (IoU) is a good evaluation metric used to measure the accuracy of the designed model on a test dataset. IoU is simply computed as the area of intersection divided by the area of union. IoU helps to determine whether a predicted BB is a True Positive (TP), False positive (FP) or False Negative (FN) by defining a threshold of

Recall: A measure of how good the model is at finding all of the positives. Precision: A measure of the accuracy of our predictions. These two terms are inversely proportional to each other.

AP: This is the area under the precision–recall curve, which shows the correlation between precision and recall at different confidence scores. A higher AP value indicates better detection accuracy.

The performance of the model while validating INRIA and Caltech test datasets was visualized using a plot between the number of false positives per image and the miss rate (MR). The ratio between the number of FNs and the total number of positive samples (N) is referred to as the MR.

There is another relationship between the miss rate and recall expressed as:

With different input image resolutions of

To have a model that runs at higher inference speed, an image size of

Input Size | Model | Average Precision (AP) | Inference Speed (FPS) |
---|---|---|---|

YOLOv2 | 75.2 | 45.1 | |

YOLOv2 Model A | 77.1 | 64 | |

79.5 | 47.2 | ||

YOLOv2 | 76.5 | 32 | |

YOLOv2 Model A | 78.3 | 38.2 | |

YOLOv2 | 78.2 | 26.1 | |

YOLOv2 Model A | 80.4 | 32.1 | |

82.3 | 30.6 |

The Pascal Voc-2007+2012 pedestrian dataset contains 20 different classes and every class may have small objects. We were concerned with detecting smaller and densely distributed pedestrians in this dataset, so we manually picked up 330 images that mainly included smaller pedestrians to evaluate the model performance.

The evaluation results of all three models on the INRIA test dataset are expressed in terms of average precision and inference speed (milliseconds).

Input size | Model | Average precision (AP) | Inference speed (ms) |
---|---|---|---|

YOLOv2 | 79.8 | 27.4 | |

544 |
YOLOv2 Model A | 84.6 | 24.7 |

YOLOv2 | 82.5 | 36.3 | |

608 |
YOLOv2 Model A | 87.1 | 27.8 |

93.4 | 26.5 |

To test the robustness of the proposed model, we compared our model performance on the INRIA pedestrian test dataset with several SOTA algorithms.

Models/Avg.MR (%) | Reasonable | Runtime (FPS) |
---|---|---|

VJ [ |
72.5 | < 1 |

HOG [ |
46 | < 1 |

YOLOv2 [ |
12.5 | 32 |

Very fast [ |
16 | |

Spatial pooling [ |
11.2 | < 1 |

RPN + BF [ |
6.9 | |

YOLOv3 [ |
7.2 | 20 |

Y-PD [ |
9.1 | 73 |

F-DNN [ |
||

7.8 | 36.3 |

Models/LAMR (%) | Reasonable | Average Precision (AP) | Runtime (s) |
---|---|---|---|

RPN + BF [ |
9.580 | 0.324 | 0.50 |

SA-FastRCNN [ |
9.680 | 0.344 | 0.59 |

UDN + SS [ |
11.520 | 0.331 | 0.28 |

M-GAN [ |
– | – | |

Faster RCNN + ATT-Vbb [ |
10.330 | – | – |

TTL(MRF) + LSTM [ |
7.400 | – | – |

SSNet [ |
8.920 | 0.360 | 0.43 |

Y-PD [ |
18.4 | 0.321 | – |

SDS-RCNN [ |
7.360 | 0.355 | |

CompactACT + Deep [ |
11.750 | 0.334 | 1.00 |

7.480 | 0.29 |

From

To show the findings more intuitively, regarding the real-time performance of the proposed algorithm to achieve a perfect balance between detection speed and accuracy, we fed a real-time test video to all models. The detection results of the randomly selected 79

A new advanced model named YOLOv2PD was proposed for the accurate detection of smaller and densely distributed pedestrians. The proposed network YOLOv2PD structure was designed to improve the network’s feature extraction ability by adopting the MLFF strategy and, at the higher end, one repeated convolutional layer was removed. To improve the detection accuracy while detecting smaller and more densely distributed pedestrians, the loss function was improved by applying normalization. The experimental results show that, for an applied input image of

Average Precision

Computer Vision

Compute Unified Device Architecture

Deformable Part Model

Frames per second

False Positive

False Negative

Histogram of Oriented Gradient

Intersection over Union

Miss-rate

Multi-layer Feature Fusion

Pascal Visual Object Classes

Regions Based Convolutional Neural Networks

Spatial Pyramid Pooling Network

Single Shot Multi-Box Detector

State-of-the-art

True Positive

True Negative

YOU ONLY LOOK ONCE

YOU ONLY LOOK ONCE Version 2

YOU ONLY LOOK ONCE Version 2 Based Pedestrian Detection