Estimating the crowd count and density of highly dense scenes witnessed in Muslim gatherings at religious sites in Makkah and Madinah is critical for developing control strategies and organizing such a large gathering. Moreover, since the crowd images in this case can range from low density to high density, detection-based approaches are hard to apply for crowd counting. Recently, deep learning-based regression has become the prominent approach for crowd counting problems, where a density-map is estimated, and its integral is further computed to acquire the final count result. In this paper, we put forward a novel multi-scale network (named 2U-Net) for crowd counting in sparse and dense scenarios. The proposed framework, which employs the U-Net architecture, is straightforward to implement, computationally efficient, and has single-step training. Unpooling layers are used to retrieve the pooling layers’ erased information and learn hierarchically pixel-wise spatial representation. This helps in obtaining feature values, retaining spatial locations, and maximizing data integrity to avoid data loss. In addition, a modified attention unit is introduced and integrated into the proposed 2U-Net model to focus on specific crowd areas. The proposed model concentrates on balancing the number of model parameters, model size, computational cost, and counting accuracy compared with other works, which may involve acquiring one criterion at the expense of other constraints. Experiments on five challenging datasets for density estimation and crowd counting have shown that the proposed model is very effective and outperforms comparable mainstream models. Moreover, it counts very well in both sparse and congested crowd scenes. The 2U-Net model has the lowest MAE in both parts (Part A and Part B) of the ShanghaiTech, UCSD, and Mall benchmarks, with 63.3, 7.4, 1.5, and 1.6, respectively. Furthermore, it obtains the lowest MSE in the ShanghaiTech-Part B, UCSD, and Mall benchmarks with 12.0, 1.9, and 2.1, respectively.

Automatic crowd analysis is essential for effective crowd management for every entity responsible for ensuring public safety. Two of the most significant and recent tasks in crowd analysis are density estimation (DE) and crowd counting (CC) [

DE in computer vision is intended to estimate the spatial distribution of a crowd image, and CC seeks to compute the number of people in images or videos automatically. Accurate CC is required in many situations and occasions, such as public demonstrations, sports activities, and religious gatherings [

There are four main constraints for running current CNN crowd counting models: the model size, the number of model parameters, the run-time memory requirement, and the counting accuracy. Some methods have been proposed to overcome or improve some of these limitations, but at the expense of other constraints. For example, complex models with a large number of parameters will almost certainly result in time-consuming and suboptimal issues, which is inconvenient for applications that require quick reaction. To recap, it is still far from the desired balance of accuracy and efficiency in a real-world scenario. This research seeks to get high counting accuracy using a simple CNN-network architecture with fewer parameters but yet a smaller model size using two U-Net streams. To summarize, the following are the main novelties and contributions of this study:

To cope with the challenge of crowd counting in realistic circumstances, a multi-scale framework named 2U-Net is proposed. Using two efficient parallel encoder-decoder architectures, the proposed model can gain rich contextual information and be able to construct a high-quality density map with high CC accuracy. It has a lower number of parameters and yields competitive results. To our knowledge, no studies have attempted to focus on estimating high-quality density maps while preserving a smaller number of parameters. We also use several quality metrics to assess the quality of density maps created by the proposed framework, including peak signal-to-noise ratio and structural similarity index.

To allow the 2U-Net model to focus on crowd areas, a modified attention unit has been introduced and integrated into the 2U-Net architecture.

To tackle the issue of data loss actuated by the pooling layers of the U-Net, unpooling layers are utilized to upsample the downsampled maps.

To test the performance of the proposed 2U-Net, five challenging benchmarks for image and video crowd counting are utilized. The paper is intended for crowd counting in the holy places of Makkah and Madinah as a special case study for congested crowd scenes; thus, we used the Haramain benchmark [

The following are the other sections of this article:

Several approaches have been presented in the literature to address the challenges of DE and CC, which can be classified into two major groups: traditional approaches and deep learning approaches. Further details on the DE and CC approaches are given in the next sub-sections.

In early studies on crowd counting, researchers used detection-based approaches, which utilized a sliding window to detect every individual and then estimate the number of observed instances. The detection-based approaches utilize handcrafted features derived from a single pedestrian to train a classifier [

Motivated by the outstanding performance of deep learning in the computer vision field [

Several multicolumn or multi-branch architectures are usually adopted to address scale variation and cluttered backgrounds for better counting accuracy. Different receptive fields are used in these column architectures to accommodate different crowd densities, which represents a challenge due to the diversity of people’s crowd densities [

Unlike other methods, the proposed 2U-Net aims to construct high-quality density estimation maps by using two parallel U-Nets and maintaining spatial information, followed by one convolutional layer to fuse the generated density and attention maps. Consequently, pixel-wise regression counting accuracy in the predicted map has improved. Thus, in a larger sense, estimating the density map is identical to other localization challenges like tracking [

The aim of this research is to address the image/frame crowd counting problem. Previous studies have found that density-based crowd counting approaches accomplish higher performance than directly regressing the number of individuals [

In this work, a non-linear regression function is learned by decreasing the MSE loss (^{st} U-Net) and the BCE loss (^{nd} U-Net). Further details can be found in

A novel multi-scale two-stream U-Net (2U-Net) is proposed to deal with the challenge of crowd counting, especially in the holy places of Makkah and Madinah, and produce high-quality density-maps. The overall workflow of the proposed crowd counting framework using the proposed 2U-Net model is shown in

BN [

For extracting spatial features from a crowded frame, two convolutional layers are employed first, as shown in ^{(k,i)} in the ^{(k−1,t} ^{)} is convolved with a learnable kernel ^{k,i,t}, after that, all the maps are directed into an activation function (^{(k,s)} as follows:

Where

Backbone layers | Decoder layers | ||||
---|---|---|---|---|---|

Layer name | Output |
Configuration | Layer name | Output |
Configuration |

Conv2d-1 | 512 × 512 | 3 × 3, 64 | Maxunpool2d-1 | 64 × 64 | 2, stride 2 |

Conv2d-2 | 512 × 512 | 3 × 3, 64 | Conv2d-1 | 64 × 64 | 1 × 1, 256 |

MaxPool2d-1 | 256 × 256 | 2, |
Conv2d-2 | 64 × 64 | 3 × 3, 256 |

Conv2d-3 | 256 × 256 | 3 × 3, 128 | Maxunpool2d-2 | 128 × 128 | 2, stride 2 |

Conv2d-4 | 256 × 256 | 3 × 3, 128 | Conv2d-3 | 128 × 128 | 1 × 1, 128 |

MaxPool2d-2 | 128 × 128 | 2, |
Conv2d-4 | 128 × 128 | 3 × 3, 128 |

Conv2d-5 | 128 × 128 | 3 × 3, 256 | Maxunpool2d-3 | 256 × 256 | 2, stride 2 |

Conv2d-6 | 128 × 128 | 3 × 3, 256 | Conv2d-5 | 256 × 256 | 1 × 1, 64 |

Conv2d-7 | 128 × 128 | 3 × 3, 256 | Conv2d-6 | 256 × 256 | 3 × 3, 64 |

MaxPool2d-3 | 64 × 64 | 2, |
Conv2d-7 | 256 × 256 | 3 × 3, 32 |

Conv2d-8 | 64 × 64 | 3 × 3, 512 | |||

Conv2d-9 | 64 × 64 | 3 × 3, 512 | |||

Conv2d-10 | 64 × 64 | 3 × 3, 512 | |||

MaxPool2d-4 | 32 × 32 | 2, |
|||

Conv2d-11 | 32 × 32 | 3 × 3, 512 | |||

Conv2d-12 | 32 × 32 | 3 × 3, 512 | |||

Conv2d-13 | 32 × 32 | 3 × 3, 512 |

Notes: * The parameters of the convolutional layer “Conv2d” are referred to as “kernel size, number of filters, stride, dilation”. The default settings for those parameters: stride, dilation, and padding are 1, 1, 0, respectively. Maxpooling “MaxPool2d” layer is described as “kernel size, stride”.

Pooling layers result in downsampling the feature maps. From

The feature-map grid is progressively down-sampled in conventional CNN architectures to obtain a sufficient sizeable receptive field. Thus, semantic contextual features are obtained. However, decreasing false-positive predictions for tiny objects with considerable shape-changeability is still challenging. As a result, several computer vision frameworks depend on extra prior object localization models to break down the process into distinct localization and subsequent steps. Oktay et al. [

As illustrated in

The U-Net is arguably the most successful architecture in many areas relevant to computer vision, such as crowd counting, segmentation, and concrete crack detection. The U-Net architecture is symmetrical, with a contracting pathway “encoder” on the left (the encoder configuration details are described in

The whole 2U-Net model is trained using the MSE and BCE losses, and the Adam optimizer is used for optimization. Both MSE and BCE losses are utilized to train the 1^{st} U-Net and the 2^{nd} U-Net, respectively. They are defined as follows:

where

This section presents the evaluation metrics and experimental details. Then, the findings of the proposed 2U-Net are recorded and evaluated on five other common standards crowd counting benchmarks.

There are two kinds of metrics related to crowd counting and employed to assess the overall performance of the proposed 2U-Net model and test the quality of the estimated density-map: model evaluation metrics and density-map evaluation metrics. Details of these types are in the following sub-sections.

Model evaluation metrics can be performed by calculating the mean absolute error (MAE) and mean squared error (MSE) on different public datasets. The MAE and MSE [

where for an

High-resolution density-maps generally provide high location accuracy as well as maintain more spatial information for localization challenges (e.g., detection and tracking). The quality of the density-map can be examined using two standard metrics: peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [

where

There are several publicly available benchmarks for DE and CC frameworks that can be used to assess performance tests and evaluations.

Benchmark | Year | Type | Place | No. of |
Color | Resolution | No. of images/ |
---|---|---|---|---|---|---|---|

ShanghaiTech-Part A | 2016 | Image | Outdoor | 482 | RGB | Varied | 482 |

ShanghaiTech-Part B | 2016 | Image | Outdoor | 716 | RGB | 768 × 1024 | 716 |

UCF | 2013 | Image | Outdoor | 50 | RGB/Grey | Varied | 50 |

UCSD | 2008 | Video | Outdoor | 1 | Grey | 158 × 238 | 2,000 |

Mall | 2012 | Video | Indoor | 1 | RGB | 640 × 480 | 2,000 |

Haramain-H1 | 2021 | Video | Indoor | 1 | RGB | 576 × 720 | 70 |

Haramain-H2 | 2021 | Video | Outdoor | 1 | RGB | 576 × 720 | 60 |

Haramain-H3 | 2021 | Video | Outdoor | 1 | RGB | 1280 × 720 | 60 |

One of the most popular datasets for crowd-counting applications is the ShanghaiTech dataset [

Although the UCF dataset [

The UCSD benchmark [

The Mall benchmark [

The Haramain dataset [

The training and evaluation were conducted by PyTorch on a Tesla V100 GPU. For a fair comparison, we use the measurement approach described in [

We initially generate the GT density-maps and, afterward, produce GT attention-maps. Following [

where

An attention-map

In this section, we compare the results of our model to those of other models on five distinct benchmarks to highlight the efficiency of our model.

Measures like the number of model parameters and runtime are utilized to evaluate model computational complexity. A model with fewer parameters will run more efficiently, but at the cost of performance, such as the Cascaded-MTL model [

Methods | MAE↓ | MSE↓ | PSNR↑ | SSIM↑ | Parameters | Runtime (ms) | Device |
---|---|---|---|---|---|---|---|

Zhang et al. [ |
181.8 | 277.7 | – | – | 0.62M | – | – |

Cascaded-MTL [ |
126.5 | 173.5 | – | – | TITAN-X | ||

SaCNN [ |
86.8 | 139.2 | – | – | 24.1M | – | – |

Switching-CNN [ |
90.4 | 135.0 | 21.91 | 0.67 | 15.1M | 153 | – |

ACSCP [ |
75.7 | 102.7 | – | – | 5.10M | – | – |

CP-CNN [ |
73.6 | 106.4 | 21.72 | 0.72 | 62.9M | 5113 | – |

PCC Net [ |
73.5 | 124.0 | 22.78 | 0.74 | 0.55M | 89 | 1080Ti |

CSRNet [ |
68.2 | 115.0 | – | – | 16.3M | – | – |

U-ASD Net [ |
64.6 | 106.1 | 41.41 | 31.4M | 94 | Tesla V100 | |

2U-Net [ours] | 17.7M | 82 | Tesla V100 |

Method | Part A | Part B | ||
---|---|---|---|---|

MAE↓ | MSE↓ | MAE↓ | MSE↓ | |

Zhang et al. [ |
181.8 | 277.7 | 32.0 | 49.8 |

FCN [ |
126.5 | 173.5 | 23.8 | 33.1 |

MCNN [ |
110.2 | 173.2 | 26.4 | 41.3 |

Cascaded-MTL [ |
101.3 | 152.4 | 20.0 | 31.1 |

Switching-CNN [ |
90.4 | 135.0 | 21.6 | 33.4 |

CP-CNN [ |
73.6 | 106.4 | 20.1 | 30.1 |

SaCNN [ |
86.8 | 139.2 | 16.2 | 25.8 |

DAN [ |
81.8 | 134.7 | 13.2 | 20.1 |

ACSCP [ |
75.7 | 102.7 | 17.2 | 27.4 |

CSRNet [ |
68.2 | 115.0 | 10.6 | 16.0 |

PCC Net [ |
73.5 | 124.0 | 11.0 | 19.0 |

TEDnet [ |
64.2 | 109.1 | 8.2 | 12.8 |

AAFM [ |
67.1 | 104.2 | 10.6 | 15.8 |

DENet [ |
65.5 | 9.6 | 15.4 | |

FMLF [ |
69.8 | 114.7 | 10.2 | 14.9 |

DSPNet [ |
68.2 | 107.8 | 8.9 | 14.0 |

N^{2}CC [ |
85.3 | 137.4 | 18.8 | 29.2 |

ResNet-DC-PCM [ |
73.5 | 118.1 | 13.3 | 22.5 |

AWRFN [ |
66.7 | 109.1 | 11.5 | 19.5 |

Zhang et al. [ |
– | – | 8.3 | 12.9 |

SUA-Fully [ |
66.9 | 125.6 | 12.3 | 17.9 |

U-ASD Net [ |
64.6 | 106.1 | 7.5 | 12.4 |

2U-Net [ours] | 103.8 |

Network | MAE | MSE | PSNR | SSIM |
---|---|---|---|---|

U-Net [ |
16.4 | 25.0 | 47.98 | 0.99 |

2U-Net |

Method | MAE↓ | MSE↓ |
---|---|---|

Zhang et al. [ |
467.0 | 498.5 |

MCNN [ |
377.6 | 509.1 |

FCN [ |
338.6 | 424.5 |

Cascaded-MTL [ |
322.8 | 397.9 |

Switching-CNN [ |
318.1 | 439.2 |

CP-CNN [ |
295.8 | 320.9 |

SaCNN [ |
314.9 | 424.8 |

DAN [ |
309.6 | 402.6 |

ACSCP [ |
291.0 | 404.6 |

CSRNet [ |
266.1 | 397.5 |

TEDnet [ |
249.4 | 354.5 |

HA-CNN [ |
256.2 | 348.4 |

AAFM [ |
247.1 | 329.4 |

DENet [ |
241.9 | 345.4 |

MCNN-VGG [ |
244.3 | 359.7 |

N^{2}CC [ |
380.5 | 513.0 |

AWRFN [ |
257.3 | 337.2 |

ResNet-DC-PCM [ |
254.8 | |

356.1 |

Method | MAE↓ | MSE↓ | Runtime (ms) | Model size |
---|---|---|---|---|

U-ASD Net [ |
62 | 126 | ||

2U-Net | 239.4 | 356.1 |

Method | MAE↓ | MSE↓ |
---|---|---|

Gaussian process regression [ |
2.2 | 8.0 |

Cumulative attribute regression [ |
2.1 | 6.9 |

Ridge regression [ |
2.3 | 7.8 |

Count forest [ |
1.6 | 4.4 |

Zhang et al. [ |
1.6 | 3.3 |

ConvLSTM-nt [ |
1.7 | 3.5 |

Switching-CNN [ |
1.6 | 2.1 |

U-ASD net [ |
1.7 | 2.1 |

Method | MAE↓ | MSE↓ |
---|---|---|

Gaussian process regression [ |
3.7 | 20.1 |

Cumulative attribute regression [ |
3.4 | 17.7 |

Ridge regression [ |
3.6 | 19.0 |

Detector [ |
20.6 | 439.1 |

R-FCN [ |
6.0 | 5.5 |

Count forest [ |
2.5 | 10.0 |

Faster R-CNN [ |
5.9 | 6.6 |

Bi-ConvLSTM [ |
2.1 | 7.6 |

ACM-CNN [ |
2.3 | 3.1 |

ST-CNN [ |
4.0 | 5.9 |

MCNN+SEG+LR [ |
2.2 | 2.8 |

TAN [ |
2.0 | 2.6 |

FMLF [ |
1.9 | 2.3 |

ResNet-DC-PCM [ |
2.5 | 3.1 |

U-ASD Net [ |
1.8 | 2.2 |

Dataset | Method | MAE↓ | MSE↓ | Runtime (ms) | Model size (MB) |
---|---|---|---|---|---|

Haramain-H1 | UASD-Net | 2.3 | 87 | 126 | |

2U-Net | 1.6 | ||||

Haramain-H2 | UASD-Net | 7.8 | 8.6 | 86 | 126 |

2U-Net | 77 | ||||

Haramain-H3 | UASD-Net | 94 | 126 | ||

2U-Net | 9.6 | 13.1 |

To evaluate the density-map quality generated by the proposed 2U-Net, both PSNR and SSIM metrics were recorded and compared with state-of-the-art methods: Zhang et al. [

Method | Part A | Part B | ||
---|---|---|---|---|

PSNR↑ | SSIM↑ | PSNR↑ | SSIM↑ | |

Zhang et al. [ |
– | – | 28.09 | 0.89 |

Switching-CNN [ |
21.91 | 0.67 | – | – |

PCC Net [ |
22.78 | 0.74 | – | – |

CP-CNN [ |
21.72 | 0.72 | – | – |

CSRNet [ |
23.79 | 0.76 | 27.02 | 0.89 |

TEDnet [ |
25.88 | 0.83 | – | – |

DENet [ |
24.54 | 0.78 | 25.74 | 0.80 |

In this work, we proposed a new end-to-end crowd model that can accurately estimate high-quality crowd density-maps and count the crowd in images and frames, called 2U-Net. By using two-stream U-Net, high counting accuracy has been acquired. The proposed 2U-Net utilizes the unpooling operation to solve the problem of information loss induced by the pooling operations of the U-Net. Besides, a modified attention unit is introduced and integrated into the proposed 2U-Net model to concentrate on crowd regions. The results of the proposed 2U-Net model indicate that the model is effective in estimating high-quality density-maps as well as counting crowds. Furthermore, the 2U-Net model provides comparable results to the UASD-Net model with fewer parameters, lower running time, and a smaller model size. Compared with other state-of-the-art frameworks, it has been demonstrated that our framework achieves a reasonable trade-off between model performance and the number of network parameters.

Currently, our model has certain limitations in some crowd images since it does not account for various characteristics that exist in real-world locations, such as different lighting conditions. We will examine varied illumination settings in future work to lessen the impact of varying illumination on our model. In addition, we plan to apply the proposed model to more real-world use scenarios, especially in the Holy Places of Makkah and Madinah.

The authors extend their appreciation to the Deputyship of Research & Innovation, Ministry of Education in Saudi Arabia, for funding this research work through Project Number 758. The authors also would like to thank the Research Management Center of Universiti Teknologi Malaysia for managing this fund under vot. no. 4C396.

^{2}CNN: A novel method for crowd counting via two-task convolutional neural network