Appearance-based dynamic Hand Gesture Recognition (HGR) remains a prominent area of research in Human-Computer Interaction (HCI). Numerous environmental and computational constraints limit its real-time deployment. In addition, the performance of a model decreases as the subject’s distance from the camera increases. This study proposes a 3D separable Convolutional Neural Network (CNN), considering the model’s computational complexity and recognition accuracy. The 20BN-Jester dataset was used to train the model for six gesture classes. After achieving the best offline recognition accuracy of 94.39%, the model was deployed in real-time while considering the subject’s attention, the instant of performing a gesture, and the subject’s distance from the camera. Despite being discussed in numerous research articles, the distance factor remains unresolved in real-time deployment, which leads to degraded recognition results. In the proposed approach, the distance calculation substantially improves the classification performance by reducing the impact of the subject’s distance from the camera. Additionally, the capability of feature extraction, degree of relevance, and statistical significance of the proposed model against other state-of-the-art models were validated using t-distributed Stochastic Neighbor Embedding (t-SNE), Mathew’s Correlation Coefficient (MCC), and the McNemar test, respectively. We observed that the proposed model exhibits state-of-the-art outcomes and a comparatively high significance level.

The vision-based HGR is a prominent area of research in non-verbal HCI [

Recent technological advancements such as high-resolution cameras, Graphics Processing Units (GPUs), and 3D cameras have resulted in the development of numerous HGR models. The HGR models can be grouped into either depth-based or appearance-based approaches, depending on the technology utilized for the data acquisition. In the depth-based approach, a camera with an embedded distance sensor is used to acquire the raw data. The object’s depth details are provided through the distance sensor, while the camera is employed to concentrate on the hands. Kinect-1.0 [

The appearance-based approach can be further classified into static and dynamic HGR. In the static HGR data is recorded in the form of still pictures which requires the subject to hold a specific stance for an instant so that it can be accurately captured. In the case of dynamic HGR, the sequence of gesture movements is provided as the model’s input. This approach is more practical because a gesture is performed by making certain movements. For this reason, the dynamic HGR has great significance as it incorporates the behavior of Human-to-Human Interaction (HHI).

The appearance-based HGR faces several obstacles due to diverse environmental variables that affect the overall performance of the model. These problems include varying gesture velocity, illumination, skin color, and the subject’s distance from the camera. In addition, the computational complexity of a model is a major challenge. A model with high computational complexity requires expensive computing resources, which not only increases the model’s cost but also limits its implementation on a target device.

Researchers have proposed numerous HGR techniques, including the Hidden Markov Model (HMM), Templet Matching [

This paper is organized as follows. Section 2 provides an overview of the related research work in HGR. Whereas, Sections 3 and 4 discuss the dataset and the pre-processing approaches used in this research, respectively. Sections 5 and 6 provide details about the base model and proposed model, respectively. It is followed by the real-time implementation of HGR models in Section 7. The experimental setup for training, validation, and testing is discussed in Section 8. The behavioral analysis of the model is given in Section 9, which is followed by the experimental results in Section 10. Finally, a detailed discussion of the experimental results and the conclusion is provided in Section 11 and Section 12.

The advent of HCI in the early 80s paved the way for vision-based HGR. The initial vision-based HGR attempts were either performed with colored gloves or hand markers. The techniques, however, were unable to attain the required accuracy due to various technological restrictions including camera resolution and computational power. Besides color-glove and hand marker-based approaches [

In the year 1993, the vision-based HGR drew the attention of most researchers. Since then, numerous models have been developed including SVM, HMM, DTW, and Templet Matching. In addition to these machine learning models, CNN has been proven to be the most efficient feature extraction and classification model. Molchanove et al. [

Kopuklu et al. [

A 3D separable CNN model was proposed by Hu et al. [

The majority of the models proposed so far were developed to acquire the input data while only concentrating on the subject’s hands. Whereas, the proposed research work incorporates the upper half of the subject’s body, which includes the head, face, hands, and arms. In order to extract the key feature from the acquired data, several head and face detection algorithms are proposed by researchers. Saqib et al. [

HGR, regardless of the number of remarkable research contributions, still encounters certain computational and performance barriers. We observed that the deep learning architectures developed so far either have high performance but are computationally complex or vice versa. Hence, a give-and-take situation exists between the model’s complexity and performance. The objective of the proposed research was to develop a 3D CNN architecture with comparatively better recognition accuracy and less computational complexity. Besides this, the other factor that highly impacts the model’s performance is the subject’s distance from the camera. The proposed work developed an approach for real-time deployment of the model while considering the subject’s distance from the camera.

The proposed model was designed for training on a dataset that accounts for the most realistic aspects, including varying illumination, gesture velocity, and skin color. Another important factor was that a gesture sample should include the subject’s entire body above the waist rather than just their hands. This consideration was necessary to enable the proposed model for natural HHI. The 20-BN Jester dataset [

In the proposed study, we aimed to operate the basic features of a desktop using gestures, i.e., sliding left, sliding right, sliding up, sliding down, and terminating the active window on a computer. Therefore, 6 out of 27 classes from the 20BN-Jester dataset were utilized, i.e., swiping left, swiping right, swiping up, swiping down, stop sign, and no gesture class. The desktop operations were performed in correspondence to these gesture classes, except for the no-gesture class.

The deep learning architecture, in contrast to machine learning, can be trained on data samples without any pre-processing techniques. The approach reduces human effort to some extent, but it requires a comparatively longer learning time due to its high dimensionality. This problem can be addressed through pre-processing algorithms such as motion history images [

The procedure for the frame differencing algorithm depends on the following two steps:

Firstly, the input video frames were converted from RGB to grayscale, as shown in

Secondly, the subject’s movement from the first to the last frame is extracted by subtracting each frame from the subsequent frame, as shown in

Since each video sample has a distinct number of input frames, we standardized the number of input frames to 8 after the frame differencing. The frames were standardized to 8 by either adding the difference between the previous frame with itself or removing the additional frames. The removal of extra frames, i.e., generally 1 or 2 frames, caused no loss since the key gesture features were observed to end earlier, as shown in

The dynamic HGR requires the extraction of spatial as well as temporal features, which results in high-dimensional complex data. Hu et al. [

A convolution kernel size of 3 × 3 × 3 was used to reduce the number of weights,

The temporal dimension was down-sampled at the end (i.e., convolution blocks 8 and 9) to learn more features,

To learn the downsampling process, the greater value of stride was used instead of pooling, and

To extract better information, more channels were added before the downsampling.

The convolution blocks 2 to 10 performed 3D separable convolution, while convolution block 1 utilized the standard 3D convolution [

The base model was trained on 6 hand gestures from a dataset collected through HoloLens. The data set contains a total of 110,000 samples, with an average of 22,000 samples per class. The frame differencing algorithm was used to discard the complex background before training the model. The log function was used as a loss function, while the RMSprop was used for the model’s optimization. In addition, the layer-wise learning approach [

The proposed model was structured with enhanced generalization and feature extraction capabilities. These fundamental characteristics have significantly improved the proposed model’s performance in comparison to other state-of-the-art models. The steps taken to accomplish the desired outcome are discussed in the following subsections.

The base model proposed in [

In

The stride was used for the downsampling in the base model to automatically learn the process, as shown in

The parameters of the base model were reduced up to eight times before the fully connected layers, as shown in

The proposed model architectures without and with parameters enhancement are shown in

The models are comprised of 10 convolution blocks and a classification block,

The Convolution Block 1 performs the standard 3D convolution,

The Convolution Blocks 2 to 10 utilize the 3D separable convolution [

The ResNet [

The only difference between the proposed models without and with parameter enhancement is at Convolution Blocks 8 and 10, as can be observed from the output dimensions of the respective blocks in

The real-time deployment of the proposed HGR model was based on the following three factors:

Attention of the subject towards the device,

The instant of performing a gesture, and

The subject’s distance from the camera.

These factors are critical in real-time deployment because ignoring them would result in significant wastage of computing resources. In addition, ignoring the distance between the subject and the camera results in a decline in the model’s performance. The following subsections discuss the real-time deployment of the model.

The subject’s attention can be detected from the face by determining whether it is directed towards or away from the camera. This was accomplished with the help of a face detection algorithm. Researchers have proposed various face detection algorithms MTCNN [

In order to ensure the precise detection of the subject’s attention, we utilized four sequential frames in real-time. The subject is considered attentive only when the algorithm detects the face in each of these four frames. The proposed approach is explained with the help of a pseudocode, as given in Algorithm 1.

The consideration of the subject’s attention preserves some computational resources; however, a subject may be attentive but still not be performing any gesture. For this purpose, a frame differencing approach was adopted, which was based on four consecutive frames. The algorithm detects any movement below the neck based on a threshold. This approach efficiently utilizes the resources to a greater extent and also helps to compute the gesture’s duration. The pseudocode for the respective approach is given in Algorithm 2.

The model’s performance downgrades as the subject’s distance from the camera increases [

In the first step, the facial coordinates

Secondly, the threshold value of

In the case of the

An offset coefficient

The subject’s body width

The coefficient

Finally, the desired frame

The desired outcome of

The proposed approach was validated for a maximum distance of 183.1 cm between the subject and the camera. The model’s performance degraded drastically for a distance greater than 183.1 cm. The overall workflow of the real-time deployment of the proposed HGR model is shown in

The proposed model’s performance was evaluated and compared with other state-of-the-art models, i.e., 3D Separable CNN [

The proposed and comparative models were trained on 25,340 samples, validated on 2512 samples, and tested on 2000 samples. While training the models, the loss for each batch was calculated using the negative log function as defined in (6).

The loss optimization for each model was performed using the ADAM optimizer [

The real-time evaluation of the HGR models was performed on a total of 360 samples at distances of 75.3, 123.7, and 183.1 cm from the camera. The performing gestures were randomly generated on a computer screen at each individual distance with the same frequency. The real-time samples were acquired using a 2D computer camera, followed by the approach discussed in Section 7. The state-of-the-art and the proposed models were loaded in parallel so that each model obtained the same input. For comparison, the models were also evaluated without distance calculation at distances of 123.7 and 183.1 cm from the camera. Since 75.3 cm is the usual distance, therefore no distance calculation was performed at this distance. The distance was measured with a laser distance meter LDM-60, as shown in

The models were evaluated based on the recognition accuracy, computational cost, computation time, and the impact of the distance between the subject and the camera on recognition accuracy. The recognition accuracy of the models was evaluated using the average accuracy, precision, recall, and F1 score. In addition, the confusion matrices were used to obtain the individual class’s recognition accuracy. The computational costs of various models were compared based on floating-point operands (FLOPs), number of parameters, computation time using GPU, and the model’s size.

Furthermore, the relevancy of the model was tested using the MCC. The MCC ranges from

The first element “a” denotes the number of samples that are correctly predicted by the models under consideration,

The second element “b” denotes the number of samples that are correctly predicted by the first model but are incorrectly predicted by the second,

The third element “c” denotes the number of samples predicted correctly by the second model but incorrectly predicted by the first, and,

The last element “d” denotes the number of samples that are wrongly predicted by the models under consideration.

The model’s significance can be computed using the formula given

The significance of the proposed model was calculated with a 95% confidence interval (

The enhancement of the proposed model in terms of generalization, feature extraction, and the model’s parameters as discussed in Section 6 was analyzed in comparison to the base model [

The response of the proposed model with enhanced generalization in comparison to the base model is shown in

The enhancement of the model’s generalization has resulted in a low convergence rate, as can be observed from the gradient plots in

The proposed model without and with parameter enhancement was trained and tested on the dataset as mentioned in Section 8.1. The performance of the proposed model was compared with that of other state-of-the-art models, i.e., 3D separable CNN, 3D CNN, and C3D. These models provide better recognition accuracy for the HGR. However, the majority of existing models ignore the computational cost and real-time deployment of models on devices with limited resources. In this research, the proposed model and other state-of-the-art models were evaluated in both offline and real-time scenarios.

The offline testing of the proposed model and different state-of-the-art models was performed using 2000 samples of the 20BN-Jesture dataset. The performance of the models in terms of different metrics is given in

Models | Accuracy | Precision | Recall | F1 | MCC | McNemar |
---|---|---|---|---|---|---|

3D separable CNN [ |
0.6115 | 0.6935 | 0.6115 | 0.5635 | 0.5493 | 541.03 |

3D CNN [ |
0.9085 | 0.9086 | 0.9085 | 0.9084 | 0.8902 | 5.39 |

C3D [ |
0.9130 | 0.9130 | 0.9130 | 0.9129 | 0.8956 | 3.58 |

Proposed model without parameters enhancement | 0.9320 | 0.9344 | 0.9320 | 0.9325 | 0.9187 | 1.43 |

Proposed model with parameters enhancement | 0.9245 | 0.9258 | 0.9245 | 0.9248 | 0.9095 | – |

In addition, the average accuracy attained by different models using the 20-BN Jester dataset is shown in

Models | Average accuracy |
---|---|

ResNet101 [ |
97.0% |

3D CNN [ |
90.0% |

PAN ResNet101 [ |
97.4% |

X3D MobileNet-V3 [ |
95.56 |

3D-Squeeze Net [ |
90.77% |

3D-Shuffle Net V2 [ |
86.91% |

3D Mobile Net V2 [ |
86.43% |

Proposed model without parameters enhancement | 93.20% |

Proposed model with parameters enhancement | 92.45% |

In the next step, the models were tested in real-time with and without incorporating the distance calculation, and the results were analyzed.

The real-time performance of the proposed models and other state-of-the-art models for three distinct positions of the subject from the camera are given in

The proposed model with parameter enhancement at a usual distance of 75.3 cm achieved a comparatively better recognition accuracy of 91.66%, which was followed by the proposed model without parameter enhancement and 3D CNN with recognition accuracies of 90.83% and 90.00%, respectively.

At distances greater than the usual, i.e., 75.3 cm, the performance of all the HGR models drastically degrades without the distance calculation. At a distance of 123.7 cm, the best classification performance of 67.50% was obtained for the proposed model with parameter enhancement, which was followed by the proposed model without parameter enhancement and the 3D CNN with recognition accuracies of 63.33% and 55.83%, respectively. Increasing the distance to 183.1 cm further degraded the performance of all models, and the proposed model without parameter enhancement provided the best accuracy of 55.00%. The 3D Separable CNN performed poorly for both distances.

The inclusion of the distance calculation has substantially improved the overall performance of all the HGR models. At a distance of 123.7 cm, the best accuracy of 95.00% was obtained for the proposed model with parameter enhancement, which was followed by the proposed model without parameter enhancement and the 3D CNN with classification accuracies of 88.33% and 82.50%, respectively. Increasing the distance to 183.1 cm degraded the performance of all models, and the best accuracy of 89.16% was obtained for the proposed model with parameters enhancement.

Models | Distance between subject and camera (cm) | Accuracy | Precision | Recall | F1 |
---|---|---|---|---|---|

Real-time model testing at a normal distance | |||||

3D separable CNN [ |
0.5916 | 0.5046 | 0.5916 | 0.5337 | |

3D CNN [ |
0.9000 | 0.9119 | 0.9000 | 0.8985 | |

C3D [ |
75.3 | 0.8833 | 0.9024 | 0.8833 | 0.8852 |

Proposed model without parameters enhancement | 0.9083 | 0.9169 | 0.9083 | 0.9074 | |

Proposed model with parameters enhancement | 0.9166 | 0.9276 | 0.9166 | 0.9164 | |

Real time model testing without inclusion of distance calculation | |||||

3D separable CNN [ |
0.4500 | 0.3436 | 0.4500 | 0.3751 | |

3D CNN [ |
0.5583 | 0.5986 | 0.5583 | 0.5300 | |

C3D [ |
123.7 | 0.5583 | 0.4695 | 0.5583 | 0.4724 |

Proposed model without parameters enhancement | 0.6333 | 0.7573 | 0.6333 | 0.6039 | |

Proposed model with parameters enhancement | 0.6750 | 0.8796 | 0.6750 | 0.6835 | |

3D separable CNN [ |
0.3833 | 0.4988 | 0.3833 | 0.3197 | |

3D CNN [ |
0.3250 | 0.2086 | 0.3250 | 0.2342 | |

C3D [ |
183.1 | 0.4833 | 0.4615 | 0.4833 | 0.3720 |

Proposed model without parameters enhancement | 0.5500 | 0.6029 | 0.5500 | 0.4672 | |

Proposed model with parameters enhancement | 0.5000 | 0.4988 | 0.5000 | 0.3908 | |

Real time model testing with inclusion of distance calculation | |||||

3D separable CNN [ |
0.5166 | 0.4440 | 0.5166 | 0.4726 | |

3D CNN [ |
0.8250 | 0.8774 | 0.8250 | 0.8119 | |

C3D [ |
123.7 | 0.7000 | 0.8744 | 0.7000 | 0.6974 |

Proposed model without parameters enhancement | 0.8833 | 0.9093 | 0.8833 | 0.8844 | |

Proposed model with parameters enhancement | 0.9500 | 0.9522 | 0.9500 | 0.9499 | |

3D separable CNN [ |
0.3833 | 0.2695 | 0.3833 | 0.3023 | |

3D CNN [ |
0.6250 | 0.6000 | 0.6250 | 0.5834 | |

C3D[ |
183.1 | 0.5916 | 0.6626 | 0.5916 | 0.0.549 |

Proposed model without parameters enhancement | 0.7583 | 0.8638 | 0.7583 | 0.7536 | |

Proposed model with parameters enhancement | 0.8916 | 0.8992 | 0.8916 | 0.8930 |

In general, the proposed model with parameter enhancement provided the best performance in all scenarios, followed by the proposed model without parameter enhancement and the 3D CNN. It was also observed that the 3D Separable CNN performed poorly due to its inefficient feature extraction capability, leading to a high rate of overfitting. The confusion matrices for the HGR models at each distinct position mentioned in

The computational complexity of a model significantly depends on its architectural design and input dimensions. The proposed approach is based on a 3D separable CNN [

Models | MFLOPs | MParameters | Computation time (msec.) | Model size (MB) |
---|---|---|---|---|

3D separable CNN [ |
499 | 1.1 | 47.7 | 4.19 |

3D CNN [ |
9479 | 9.03 | 82.5 | 34.4 |

C3D [ |
32496 | 52.85 | 209.6 | 201 |

Proposed model without parameters enhancement | 616 | 1.1 | 68.0 | 4.19 |

Proposed model with parameters enhancement | 591 | 1.36 | 62.5 | 5.19 |

The proposed study developed two models for the HGR. The proposed HGR model without parameter enhancement is computationally more complex than the one with parameter enhancement. The McNemar test showed no significant difference between the two models. In the offline scenario, the proposed model outperformed other state-of-the-art models in terms of accuracy, precision, recall, and F1 score. Besides the offline scenario, the models were tested for real-time scenarios as well. The real-time tests were conducted at three distinct positions, both without and with consideration of the proposed distance calculation. The test results showed that the proposed model achieved better performance than other state-of-the-art models. In addition, we observed that considering the proposed distance calculation significantly improved the model’s performance and hence ensures the real-time deployment of the model at a maximum distance of 183.1 cm from the camera. In the case where the distance exceeded 183.1 cm, the model’s performance degraded as the subject’s features began to blend with the frame’s background. The 3D Separable CNN performed poorly in all scenarios due to its inefficient feature extraction capability.

The architecture design of the proposed model, besides the consideration of performance, also considers the model’s computational complexity. Other than the 3D Separable CNN, the proposed model was observed to have comparatively less computational complexity. The computational complexity of the model highly impacts the real-time performance, i.e., computationally complex models are unable to attain the desired outcome in time, and vice versa.

The proposed study developed a deep learning architecture for HGR with comparatively enhanced generalization and feature extraction capabilities while utilizing less computational resources. The comparative results in the offline scenario showed that the proposed model outperformed other state-of-the-art models in terms of accuracy, precision, recall, and F1 score. In addition, the evolution results showed higher significance for the proposed model in comparison to other state-of-the-art models. Furthermore, a novel approach was proposed for the real-time implementation of the HGR model while considering the subject’s attention, the instant of performing a gesture, and the subject’s distance from the camera. The real-time performance of the proposed model and other state-of-the-art models was evaluated at the distances of 75.3, 123.7, and 183.1 cm, with and without the consideration of the distance factor. The evaluation results showed that avoiding the distance factor resulted in significantly lower model’s performance. The inclusion of the proposed distance calculation substantially improved the performance of all the HGR models. Based on the experimental results, we can conclude that the proposed model resulted in state-of-the-art performance for both the offline and real-time scenarios.

The proposed approach is defined for a single subject within the camera’s vision range with a static background. The prediction of the proposed model will be randomized if there are two or more subjects in view of the camera or if the computational device is mobile because the model will not be able to extract the key gesture features. In the future, we aim to develop an approach that can be used for multiple subjects with a dynamic background.

The authors would like to acknowledge the support of Prince Sultan University for paying the Article Processing Charges (APC) of this publication. They would also like to thank Prince Sultan University for its support.

The authors declare that they have no conflicts of interest to report regarding the present study.