Integrating attention mechanism and multi-scale feature extraction for fall detection (2024)

Journal List
Heliyon
v.10(10); 2024 May 30
PMC11145491

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Heliyon. 2024 May 30; 10(10): e31614.

Published online 2024 May 21. doi:10.1016/j.heliyon.2024.e31614

PMCID: PMC11145491

PMID: 38831825

Hao Chen,^a,^⁎ Wenye Gu,^b Qiong Zhang,^a Xiujing Li,^a and Xiaojing Jiang^a

Author information Article notes Copyright and License information PMC Disclaimer

Associated Data

Data Availability Statement

Abstract

Addressing the critical need for accurate fall event detection due to their potentially severe impacts, this paper introduces the Spatial Channel and Pooling Enhanced You Only Look Once version 5 small (SCPE-YOLOv5s) model. Fall events pose a challenge for detection due to their varying scales and subtle pose features. To address this problem, SCPE-YOLOv5s introduces spatial attention to the Efficient Channel Attention (ECA) network, which significantly enhances the model's ability to extract features from spatial pose distribution. Moreover, the model integrates average pooling layers into the Spatial Pyramid Pooling (SPP) network to support the multi-scale extraction of fall poses. Meanwhile, by incorporating the ECA network into SPP, the model effectively combines global and local features to further enhance the feature extraction. This paper validates the SCPE-YOLOv5s on a public dataset, demonstrating that it achieves a mean Average Precision of 88.29%, outperforming the You Only Look Once version 5 small by 4.87%. Additionally, the model achieves 57.4 frames per second. Therefore, SCPE-YOLOv5s provides a novel solution for fall event detection.

Keywords: Fall events, Spatial attention, Efficient channel attention, Spatial pyramid pooling

1. Introduction

The prevention of fall events is a crucial task in daily life [1]. The risk of falling can be significantly increased by mobility limitations or environmental factors. Fall events can result in physical injuries, such as fractures and abrasions, as well as a negative impact on a person's quality of life. Therefore, real-time detection of fall events is very significant for injury prevention and personal safety [2].

Traditional detection methods mainly rely on wearable sensors to detect fall events by analyzing the user's motion data [3,4]. This method is effective but has limitations, such as inconvenience of wearing and personal privacy [5]. Video surveillance techniques have gained a lot of attention as technology has advanced. This non-invasive method recognizes fall events through images or video data acquired by cameras. This method is highly accurate and real-time. Therefore, vision-based fall detection methods have become a new focus of fall detection research [6].

With the development of deep learning, Convolutional Neural Network (CNN)-based methods are widely used for target detection [7]. Among them, the You Only Look Once (YOLO) model is widely used in various scenarios [[8], [9], [10]], due to its efficient target detection abilities. However, the YOLO model faces two major challenges in fall detection. Firstly, the fall pose has significant scale variation, which complicates the model's ability to extract multi-scale features. To address this problem, researchers have attempted to enhance the feature extraction module of the YOLO model, aiming to enable it to capture these multi-scale features more efficiently and thus identify fall events more accurately. Secondly, fall postures often involve subtle pose features, such as minor movements of body parts and slight changes in balance. These subtle features are crucial for accurately distinguishing normal activities from fall events, but traditional feature extraction methods often struggle to capture these subtle changes. Therefore, further optimization of the YOLO model is necessary to better focus on the spatial and positional features of falls, to identify fall events more accurately.

The Efficient Channel Attention (ECA) [11] network primarily focuses on channel attention but pays slightly less attention to spatial attention. As a result, when it comes to the spatial distribution of various pose and shape features, models may find it challenging to extract this critical information, posing difficulties in distinguishing between normal activities and fall events. The Spatial Pyramid Pooling (SPP) [12] network can capture features on different scales, which is crucial for recognizing fall events on various scales. However, traditional SPP networks employ a max-pooling strategy, which causes the model to focus more on local features in the image while ignoring the broader background information. In fall detection, the background information is crucial for distinguishing fall events from daily activities. Therefore, traditional SPP networks may lead to misclassification by ignoring this background information. Moreover, fall events usually involve multiple local features and contextual information. However, traditional SPP networks do not fully consider spatial relationships and inter-channel dependencies in images, which limits the performance of the model in complex scenes.

In summary, this paper builds on the You Only Look Once version 5 small (YOLOv5s) and proposes the Spatial Channel and Pooling Enhanced YOLOv5s (SCPE-YOLOv5s) model. The choice of YOLOv5s is justified by existing research that has demonstrated its effectiveness in object detection tasks [13]. The most important contributions of this paper are as follows:

(1)
An improved ECA network is proposed and added to the enhanced feature extraction layer of YOLOv5s. By enhancing the spatial attention mechanism, this network enables the model to understand the spatial distribution of the feature's pose more accurately in the fall detection scene, thereby effectively differentiate between normal activities and fall events.
(2)
An improved SPP network is proposed. The average pooling layers are introduced into the SPP network, which helps the model to extract multi-scale features of fall events and enhances the model's ability to capture background information in the fall environment. Additionally, an improved ECA network is incorporated into the SPP network to further enhance the global and local feature extraction abilities.
(3)
This paper conducted experiments on a public dataset to validate the effectiveness of the proposed model. The experimental results show that, when compared to other state-of-the-art methods, the proposed model is optimal for detecting fall events.

The remainder of the paper is organized as follows: Section 2 describes related work. Section 3 describes the proposed model in detail. Section 4 conducts the experiments and analyzes the results. A discussion is provided in Section 5. A conclusion is provided in Section 6.

2. Related work

Table 1 illustrates a summary of related work. Traditional methods of fall detection rely heavily on wearable sensors. For example, Kerdjidj et al. [14] used a wearable Shimmer device that transmits inertial signals to a computer through a wireless connection, enabling fall detection in elderly patients. Chander et al. [15] proposed a method using a soft robotic stretch sensor to monitor human movement and perform fall detection. This method provides higher flexibility and comfort. Additionally, Alarifi et al. [16] collected information through wearable devices, performed feature analysis and dimensionality reduction, and effectively extracted key features related to falls. The advantage of this method is its ability to improve detection accuracy through machine learning algorithms. Nooruddin et al. [17] proposed an Internet of Things (IoT) system that can be deployed on any type of device. The generality of this method allows it to be adapted to different scenarios and user requirements. Additionally, Al Nahian et al. [18] proposed a novel fall detection scheme based on wearable accelerometer data, where the main features are extracted through feature reduction techniques. Although obtaining fall information through sensors is a very direct and practical method, its limitations cannot be ignored. Firstly, wearable devices can be challenging for older people to wear. Secondly, since the sensors need to continuously collect the user's movement data, these data may include sensitive personal information, posing significant privacy concerns. Therefore, the application of the above methods in the field of fall detection still faces many challenges.

Table 1

Related work summary.

Category	Reference	Description	Key Feature
Wearable Devices	[14]	A wearable Shimmer device.	It offers real-time, precise monitoring but is uncomfortable to wear and may raise privacy concerns.
	[15]	Soft robotic stretch sensors.
	[16]	A tri-axial device with a magnetometer, gyroscope, and accelerometer.
	[17]	An IoT system.
	[18]	Wearable accelerometer sensors.
YOLO	[19,20]	YOLOv5 and lightweight networks.	It offers non-contact monitoring for various posture scales, but extracting subtle spatial and positional features is challenging.
	[[21], [22], [23]]	Enhanced YOLO with multi-scale features.
	[[24], [25], [26]] [27,28]	Enhanced YOLO with attention modules.

Open in a separate window

With the rapid development of computer vision technology, video-based fall detection methods have gradually received attention from researchers. Among them, the YOLO model has attracted attention for its efficient target detection ability. Several researches have employed YOLO for the detection of fall events. For example, Bo et al. [19] compared the performance of different YOLO models in fall detection. Kan et al. [20] combined the group shuffle convolution to implement the lightweight YOLOv5, which enhanced the detection abilities. However, these methods still face challenges in dealing with fall poses on different scales. To improve the detection accuracy, researchers improved the design of the feature extraction module. Fan et al. [21] considered multi-scale features in their design to better capture local features and contextual information when a person falls. Lyu et al. [22] enhanced the SPP network by adding an additional 1×1 max-pooling layer, achieving significant detection results. Additionally, to effectively extract multi-scale features, Abas et al. [23] proposed a CNN model combining multiple max-pooling layers and combined it with YOLO to achieve accurate target detection. This suggests that the introduction of multiple pooling layers is crucial for the effective extraction of multi-scale features of fall events. Furthermore, to distinguish between normal activities and fall events, the model needs to capture more detailed features more accurately. For this reason, researchers have attempted to enhance the performance of the YOLO model by introducing an attention mechanism. It is because the introduction of the attention mechanism enables the model to better focus on the key information in the image [24,25]. For example, Chen et al. [26] introduced a spatial attention module in the backbone network to extract more location features. Zhao et al. [27] improved the feature extraction ability for fall target detection by enhancing both the coordinate attention and shuffle attention mechanisms. Wang et al. [28] proposed adding squeeze-and-excitation networks to the last layer of the backbone network to further enhance feature extraction abilities. Given the features of fall movements, it is crucial to pay attention to both spatial and positional information. Therefore, introducing a multi-scale detection module and attention mechanism is of great significance for addressing the problems of varying scales and subtle gesture features in fall detection.

In summary, although traditional fall detection methods have been automated to a certain extent, their application still faces many challenges due to the inconvenience of wearing devices and privacy concerns. Computer vision-based fall detection methods, especially the target detection method using the YOLO model, offer new insights for fall detection.

3. Methodology

Fig. 1 illustrates the structure of SCPE-YOLOv5s. The model consists of three parts: the BackBone, the Neck, and the Head. Firstly, the BackBone is the main backbone feature extraction network of the model. It includes Focus, Conv-BatchNorm2d-Sigmoid-weighted Linear Unit (SiLU) (CBS), Cross Stage Partial (CSP) and Spatial Pyramid Pooling with Attention (SPPA). Focus is responsible for up-sampling operations and expanding the number of channels. CBS is a combined module integrating convolution, batch normalization, and SiLU activation function. CSP, a special network structure, is designed to extract and integrate the backbone features. SPPA is an improved version of SPP, efficiently enlarges the sensory field of the network by combining average pooling and max-pooling. Next, the Neck is the enhanced feature extraction network. It mainly consists of the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN). In the FPN, the effective feature layers that have been obtained are used to continue extracting features. PAN realizes the fusion between different layers of features through up-sampling and down-sampling operations. After each up-sampling step, an Efficient Channel and Spatial Attention (ECSA) network is introduced. This attention mechanism module allows the model to understand and analyze the image content more comprehensively. Finally, the Head is responsible for predicting the location and category of fall events.

Open in a separate window

Fig. 1

SCPE-YOLOv5s structure.

3.1. ECSA network

In fall scenarios, the effectiveness of feature detection can vary significantly due to environmental and other factors. Fall detection requires not only recognizing a person's pose and shape but also understanding the spatial distribution of these features. Spatial attention mechanisms allow the model to focus more accurately on areas where a fall is likely to occur. For example, if a person falls to the floor, the model can use the spatial attention mechanism to focus its attention on the person's pose on the floor while ignoring background regions that are not related to the fall event.

Fig. 2 shows the structure of the ECSA. Suppose the input feature map is denoted as $F \in R^{C \times W \times H}$ , where $C$ represents the number of channels, $H$ represents the height, and $W$ represents the width. The output obtained after average pooling is denoted as $F^{'} \in R^{1 \times W \times H}$ . The size of the convolution kernel for one-dimensional convolution is adapted automatically by a function [11]. The function $k$ can be represented as shown in Equation (1):

$Equation 1.$

(1)

where $γ$ equals 2, and $a$ equals 1. A one-dimensional convolution is applied on the output after average pooling. Subsequently, attention weights are generated using a Sigmoid function activation. The resulting output $F^{″}$ can be expressed as shown in Equation (2):

$Equation 2.$

(2)

3.2. SPPA network

The SPP network is a technique to achieve feature fusion at different scales. It performs feature extraction by max-pooling operation with different convolutional kernel sizes to improve the sensory field of the network. However, there are some issues with the original SPP network. Firstly, the SPP network does not consider the importance of spatial and channel information. Certain feature channels, such as shape and position, are important in fall scenes. Secondly, fall detection requires not only recognizing a person's poses but also understanding the environmental context. The model's understanding of the context is enhanced by average pooling. Additionally, it is critical to capture features on different scales for fall detection because the fall pose and background context can be very different. Multi-scale features help the model to recognize fall poses on various scales more accurately.

Fig. 3 (a) illustrates the original SPP [12] structure. Fig. 3 (b) illustrates the SPPA structure. Suppose the input feature map is denoted as $X \in R^{C \times W \times H}$ . The output representation of the input feature map obtained after the CBS module is as shown in Equation (8):

$Equation 8.$

(8)

where $C o n v (F, w)$ denotes the convolution operation, $ω$ is the weight of the convolution kernel, $b$ is the bias of the convolution, $α$ and $β$ are the learnable parameters, and $δ$ and $ε^{2}$ are the mean and variance, respectively. The activation function can be represented as shown in Equation (9):

$Equation 9.$

(9)

Open in a separate window

Fig. 3

Network structure. (a) SPP. (b) SPPA.

The max-pooling operation is calculated as shown in Equation (10):

$Equation 10.$

(10)

where $X_{C : W : H}^{'}$ denotes the local area in width, height, and channel, and $\max$ denotes the max-pooling operation. The average pooling operation is calculated as shown in Equation (11):

$Equation 11.$

(11)

where $a v g$ denotes the average pooling operation. The outputs of all pooling layers are then spliced together to form a new feature vector. The feature of the spliced vector is denoted as: $\overline{P} = [P_{1}, P_{2}, P_{3}, M_{1}, M_{2}, M_{3}]$ . Then, the output feature map can be obtained as $X^{″}$ according to Equation (8). Finally, the output obtained after the ECSA network can be expressed as shown in Equation (12):

$Equation 12.$

(12)

4. Experiments and results

4.1. Dataset

The dataset has a total of 1440 images, containing both normal and fall states. The total number of fall event labels is 1360. There are 1170 images in the training set, 130 images in the validation set, and 140 images in the test set. The training and validation sets are divided into a 9:1 ratio. To improve the reliability of the experimental results, this paper employs 10-fold cross-validation [29]. Specifically, the training data and the validation data are divided into ten sub-samples in total. For each validation, nine of these sub-samples are used for training and one for validation.

4.2. Experimental setting and training results

Software environment: the operating system is Windows 10, the deep learning framework is PyTorch, and the programming language is Python. Hardware environment: CPU is Intel Core i7, GPU is NVIDIA RTX 3060, memory is 12G, and CUDA version is 11.7.

To ensure the model's training effect and stability, the parameters are set as follows: the image input size is 640*640. The optimizer used is Stochastic Gradient Descent (SGD). The initial learning rate is set at 0.01. The epoch is 400. The batch size is 16. Fig. 4 shows the loss function during the training and validation process. As can be seen from the figure, the loss value decreases as the training time increases. When epoch is 80, the loss function begins to stabilize, indicating that the model is beginning to converge.

Open in a separate window

Fig. 4

4.3. Evaluation indicators

Assume $T P$ denotes that positive samples are considered positive, $F P$ denotes that negative samples are considered positive, and $F N$ denotes that positive samples are considered negative. The formulas for precision and recall [22] are shown below as Equations (13), (14):

$Equation 13.$

(13)

$Equation 14.$

(14)

Average Precision (AP) [20] evaluates the algorithm's ability to balance precision and recall across different categories. AP is calculated as shown in Equation (15):

$Equation 15.$

(15)

where $R_{n}$ and $P_{n}$ are recall and precision at threshold $n = 0.5$ . Fig. 5 displays the model's P–R curves during the training process. The mean Average Precision (mAP) [20] is the average value of each type of AP. mAP is calculated as shown in Equation (16):

$Equation 16.$

(16)

where $k$ denotes the number of categories. Because there is only one label for fall events in the dataset, $k = 1$ is used. Higher mAP values indicate higher accuracy of the model in predicting fall events.

Open in a separate window

Fig. 5

P–R curve.

Frames Per Second (FPS) is used to evaluate the detection speed of a model. The higher the FPS, the faster the model's processing speed, indicating that it can complete object recognition in a shorter amount of time.

To verify the significance of the performance improvements of SCPE-YOLOv5s more rigorously over YOLOv5s, this paper employs the Wilcoxon test [30]. This test is a nonparametric statistical method for comparing the central tendencies of two related samples. The Wilcoxon test does not require the data to follow a normal distribution, making it particularly suitable for analyzing small samples or data that do not satisfy the conditions for normality.

4.4. Experiments results

4.4.1. Ablation experiment

In this paper, ablation experiments are conducted to assess the effect of various modules in SCPE-YOLOv5s on model performance. Fig. 6 demonstrates the changes in mAP for the four models during the training process. By comparison, it is found that SCPE-YOLOv5s performs best at epoch 320, indicating that the model achieved the best performance in the validation set after 320 rounds of training. During training, the mAP of the other three models also fluctuated; however, their highest mAP did not exceed that of SCPE-YOLOv5s.

Open in a separate window

Fig. 6

Trend plots of mAP for different models during training.

Table 2 presents the experimental results of four models. It reveals the optimal performance of SCPE-YOLOv5s, with a mAP of 88.29%. Compared with YOLOv5s, the mAP of SCPE-YOLOv5s improved by 4.87%; compared with YOLOv5s+SPPA, the mAP improved by 3.83%; compared with YOLOv5s+ECSA, the mAP improved by 2.73%. The SCPE-YOLOv5s′ performance is demonstrated by these data. Additionally, the introduction of either SPPA or ECSA in YOLOv5s brings about an increase in mAP. This indicates that the two networks, SPPA and ECSA, are performing well. As shown in the table, the FPS of YOLOv5s is 62.6, highlighting its excellent processing speed. After introducing either the SPPA or ECSA network alone, the model's performance remains high, although there is a slight decrease in FPS. It is worth mentioning that when both networks are applied simultaneously, the FPS of SCPE-YOLOv5s stabilizes at 57.4. This fully demonstrates the excellent performance of SCPE-YOLOv5s.

Table 2

Ablation results on a dataset.

Model	SPPA	ECSA	mAP (%)	FPS (f/s)
YOLOv5s			83.42	62.6
	✓		84.46	60.8
		✓	85.56	59.6
	✓	✓	88.29	57.4

Open in a separate window

To demonstrate the performance of different models more intuitively, Fig. 7 shows the actual detection results of YOLOv5s and SCPE-YOLOv5s. From Fig. 7 (a), it has been discovered that YOLOv5s occasionally fails to completely detect the target. In contrast, as shown in Fig. 7 (b), SCPE-YOLOv5s provides more comprehensive detection results. Furthermore, the confidence level of the targets detected by YOLOv5s is generally low, which is significantly lower than the detection results of SCPE-YOLOv5s. Also, the heat maps are being generated. The heat map visually demonstrates how much attention the model pays to different regions. From the heat map, we can see that SCPE-YOLOv5s exhibits a higher confidence level in the detected regions with a more distinct red color, which is consistent with its higher confidence score. In comparison, YOLOv5s, while also correctly labeling the target, exhibits a lower intensity of color, indicating that the model is indeed less confident in its own predictions than SCPE-YOLOv5s.

Open in a separate window

Fig. 7

Detection results. (a) YOLOv5s. (b) SCPE-YOLOv5s.

4.4.2. Comparison experiment

To evaluate the performance of different models on the fall event dataset, You Only Look Once version 3 (YOLOv3) [31], You Only Look Once version 4 (YOLOv4) [32], Improved YOLOv5s [26], and SCPE-YOLOv5s are selected for comparative experiments. Among them, YOLOv3 and YOLOv4 are classical models in the field of target detection; Improved YOLOv5s further optimizes the feature extraction process by integrating asymmetric convolutional blocks and spatial attention mechanisms. The performance of these models in real applications can be more accurately evaluated by comparing their performance on the same dataset.

The experimental results of each comparison model are shown in Table 3. From the table, SCPE-YOLOv5s has the best performance on mAP, with an improvement of 6.53% compared to YOLOv3; 6.01% compared to YOLOv4; and 4.08% compared to Improved YOLOv5s. These data fully demonstrate the superior performance of SCPE-YOLOv5s on the fall event detection task.

Table 3

Comparison results on a dataset.

Model	mAP (%)
YOLOv3 [31]	81.76
YOLOv4 [32]	82.28
Improved YOLOv5s [26]	84.21
SCPE-YOLOv5s	88.29

Open in a separate window

Fig. 8 illustrates the detection results of the comparative models. Fig. 8(a) shows YOLOv3 accurately detecting targets, whereas Fig. 8(b) indicates that YOLOv4 has target detection errors in some cases. Fig. 8(c) presents the improved YOLOv5s, which also accurately detects targets. It is particularly noteworthy that Fig. 8(d), depicting SCPE-YOLOv5s, detects targets with higher confidence than the other models. Furthermore, this higher confidence level is clearly visible when analyzing the heatmaps. SCPE-YOLOv5s heatmaps have a higher level of confidence in the target area.

Open in a separate window

Fig. 8

Detection results. (a) YOLOv3. (b) YOLOv4. (c) Improved YOLOv5s. (d) SCPE-YOLOv5s.

This paper compares the performance of YOLOv5s and SCPE-YOLOv5s on a dataset using 10-fold cross-validation. Table 4 illustrates the performance data of both models on the same dataset, showing that SCPE-YOLOv5s consistently achieves higher mAP than YOLOv5s. Furthermore, the results have been statistically analyzed using the Wilcoxon test. Table 5 demonstrates the Wilcoxon test results, with a statistic of 0 and a p-value of 0.002. This result supports the conclusion that the performance improvement of SCPE-YOLOv5s is statistically significant.

Table 4

10-fold cross-validation results.

Model	mAP (%)
Model	1	2	3	4	5	6	7	8	9	10
YOLOv5s	83.56	83.41	84.42	82.12	82.89	84.45	82.50	83.70	83.82	83.35
SCPE-YOLOv5s	88.56	86.74	88.30	89.27	88.57	87.89	86.26	88.85	88.90	89.52

Open in a separate window

Table 5

Wilcoxon test results on YOLOv5s and SCPE-YOLOv5s.

Indicator	Value
Wilcoxon Statistic	0
P-value	0.002

Open in a separate window

5. Discussion

In daily life, early detection of fall events can significantly reduce the risk of injury. Therefore, this paper proposes a fall detection method based on SCPE-YOLOv5s, aiming to enhance the recognition abilities in complex fall scenarios. The main contribution of this paper is the development of enhanced ECA and SPP networks, integrated into YOLOv5s, which significantly improves the accuracy of fall detection.

The mechanism of model performance improvement is investigated in this paper through ablation experiments. mAP is improved by 4.87% for SCPE-YOLOv5s compared to YOLOv5s. This indicates that by improving the model's feature extraction ability, particularly in capturing spatial and background information, the performance of fall detection can be effectively improved. Compared to the traditional ECA, ECSA enhances the understanding of spatial distribution in a person's pose for fall detection by improving spatial attention in feature extraction. Compared with the traditional SPP, SPPA enhances the ability to capture background information in the fall environment by introducing an average pooling layer. Additionally, the improvements of the SPPA network incorporate max-pooling and average pooling operations, which help the model to extract multi-scale features of the fall event more easily. Because of the diversity of person poses and environments in a fall scene, the extraction of multi-scale features is critical for the model's recognition ability. The max-pooling operation allows the model to capture the larger features of the image, whereas the average pooling operation allows the model to extract the detailed information in the image. By combining these two pooling operations, the SPPA network can extract the features of the fall scene more comprehensively, thus improving the recognition performance of the model. Furthermore, this paper combines ECSA and SPPA. Experimental results show that by strengthening the global and local feature extraction abilities, this combination can improve the model's recognition ability in fall scenes. To demonstrate the model's performance more intuitively, this paper includes actual target detection comparison charts and heat maps in the experiments. As a result of these visualization results, the superiority of SCPE-YOLOv5s in the fall detection task can be seen more clearly.

In the comparison experiments, this paper compares the proposed method with YOLOv3, YOLOv4, and improved YOLOv5s. The experimental results show that the proposed method has a significant advantage in mAP when compared to other methods. This indicates that the proposed method is more effective in the fall detection task. It is worth noting that although YOLOv3 and YOLOv4 show strong performance in the target detection task, they do not perform as well as YOLOv5s in the specific scenario of fall detection. This is primarily due to their limited ability to extract features and attention mechanisms, which are insufficient to accurately capture the nuances of fall behavior. Additionally, Improved YOLOv5s significantly improves the feature extraction performance by using asymmetric convolutional blocks and spatial attention mechanisms. These enhancements improve the model's ability to better understand the semantic information in the scene. However, in practice, the method still has limitations when it comes to dealing with the spatial distribution and background complexity that is specific to fall detection. The ability to capture the dynamic interaction between the background and the character during the fall process needs to be improved. Through comparison and analysis, we have discovered that SCPE-YOLOv5s is much better at understanding complex fall scenarios, which is attributed to the combined improvements of the ECSA and SPPA. These improvements not only enhance the spatial attention of the model, but also improve the ability to extract multi-scale features.

This paper clearly reveals the significant performance enhancement of SCPE-YOLOv5s compared to YOLOv5s through rigorous 10-fold cross-validation. This enhancement is fully reflected in the mAP statistics and is solidly supported by statistical analysis using the Wilcoxon test. Specifically, the Wilcoxon test yields a statistic of 0 with a p-value of 0.002. This result demonstrates that the advantages of the improved model are not only significant but also highly statistically reliable.

Even though SCPE-YOLOv5s has achieved good results in experiments, it still has some limitations in distinguishing between lying and falling events. The existing dataset may not fully cover all potential lying and falling postures, making it difficult for the model to distinguish between them when faced with new postures.

6. Conclusion

This paper proposes a SCPE-YOLOv5s model that aims to allow for the detection of fall events in daily life. By incorporating spatial attention paths into the ECA network, the model can provide a better understanding of the features of the distribution of a person's pose in a fall scenario. Additionally, the improved ECA network is embedded into the up-sampling process of the enhanced feature extraction network, which significantly improves the local and global feature extraction abilities of the model. Furthermore, adding average pooling layers to the SPP network not only enhances the multi-scale feature extraction ability of the model, but also optimizes the background information grasping ability. In this paper, the fall event dataset is validated. The ablation experiment results show that SCPE-YOLOv5s improves the mAP by 4.87% over YOLOv5s. In addition, compared with other state-of-the-art algorithms, SCPE-YOLOv5s is still optimal.

Future work will further extend the dataset, investigate the model's ability to recognize different fall postures, and improve the accuracy and reliability of detection. The human detection frame and pose estimation results are used as inputs to the model to distinguish between lying events and falling events.

Data availability statement

The data that support the findings of this study are available in [AI Studio] at https://aistudio.baidu.com/datasetdetail/94809/1.

CRediT authorship contribution statement

Hao Chen: Writing – review & editing, Writing – original draft, Visualization, Resources, Project administration, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Wenye Gu: Supervision, Software, Resources, Investigation, Formal analysis, Data curation, Conceptualization. Qiong Zhang: Writing – review & editing, Validation, Supervision, Resources, Investigation. Xiujing Li: Validation, Resources, Investigation, Formal analysis. Xiaojing Jiang: Visualization, Resources, Investigation, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This research was funded by Natural Science Research of Jiangsu Higher Education Institutions of China (Grant 23KJD520011) and Directive Projects of Nantong municipal science and technology plan (Grant MSZ2023175). The authors are thankful to all the personnel who either provided technical support or helped with data collection. We also acknowledge all the reviewers for their useful comments and suggestions.

References

1. Xu T., Zhou Y., Zhu J. New advances and challenges of fall detection systems: a survey. Appl. Sci. 2018;8(3):418. [Google Scholar]

2. Igual R., Medrano C., Plaza I. Challenges, issues and trends in fall detection systems. Biomed. Eng. Online. 2013;12(1):66. [PMC free article] [PubMed] [Google Scholar]

3. Singh A., Rehman S.U., Yongchareon S., Chong P.H.J. Sensor technologies for fall detection systems: a review. IEEE Sensor. J. 2020;20(13):6889–6919. [Google Scholar]

4. Wang X., Ellul J., Azzopardi G. Elderly fall detection systems: a literature survey. Front. Robot. AI. 2020;7:71. [PMC free article] [PubMed] [Google Scholar]

5. Mubashir M., Shao L., Seed L. A survey on fall detection: principles and approaches. Neurocomputing. 2013;100:144–152. [Google Scholar]

6. Zhang Y., Zheng X., Liang W., Zhang S., Yuan X. Visual surveillance for human fall detection in healthcare IoT. IEEE MultiMedia. 2022;29(1):36–46. [Google Scholar]

7. Tabata A.N., Zimmer A., dos Santos Coelho L., Mariani V.C. Analyzing CARLA's performance for 2D object detection and monocular depth estimation based on deep learning approaches. Expert Syst. Appl. 2023;227 [Google Scholar]

8. Tarimo S.A., Jang M.A., Ngasa E.E., Shin H.B., Shin H., Woo J. WBC YOLO-ViT: 2 Way-2 stage white blood cell detection and classification with a combination of YOLOv5 and vision transformer. Comput. Biol. Med. 2024;169 [PubMed] [Google Scholar]

9. Aly G.H., Marey M., El-Sayed S.A., Tolba M.F. YOLO based breast masses detection and classification in full-field digital mammograms. Comput. Methods Progr. Biomed. 2021;200 [PubMed] [Google Scholar]

10. Ajayi O.G., Ashi J., Guda B. Performance evaluation of YOLO v5 model for automatic crop and weed classification on UAV images. Smart Agric. Technol. 2023;5 [Google Scholar]

11. Wang Q., Wu B., Zhu P., Li P., Zuo W., Hu Q. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. ECA-Net: efficient channel attention for deep convolutional neural networks; pp. 11534–11542. [Google Scholar]

12. He K., Zhang X., Ren S., Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015;37(9):1904–1916. [PubMed] [Google Scholar]

13. Souza B.J., Stefenon S.F., Singh G., Freire R.Z. Hybrid-YOLO for classification of insulators defects in transmission lines based on UAV. Int. J. Electr. Power Energy Syst. 2023;148 [Google Scholar]

14. Kerdjidj O., Ramzan N., Ghanem K., Amira A., Chouireb F. Fall detection and human activity classification using wearable sensors and compressed sensing. J. Ambient Intell. Hum. Comput. 2020;11:349–361. [Google Scholar]

15. Chander H., Burch R.F., Talegaonkar P., Saucier D., Luczak T., Ball J.E., Prabhu R.K. Wearable stretch sensors for human movement monitoring and fall detection in ergonomics. Int. J. Environ. Res. Publ. Health. 2020;17(10):3554. [PMC free article] [PubMed] [Google Scholar]

16. Alarifi A., Alwadain A. Killer heuristic optimized convolution neural network-based fall detection with wearable IoT sensor devices. Measurement. 2021;167 [Google Scholar]

17. Nooruddin S., Islam M.M., Sharna F.A. An IoT based device-type invariant fall detection system. Internet of Things. 2020;9 [Google Scholar]

18. Al Nahian M.J., Ghosh T., Al Banna M.H., Aseeri M.A., Uddin M.N., Ahmed M.R., Kaiser M.S. Towards an accelerometer-based elderly fall detection system using cross-disciplinary time series features. IEEE Access. 2021;9:39413–39431. [Google Scholar]

19. Bo L.U.O. Human fall detection for smart home caring using yolo networks. Int. J. Adv. Comput. Sci. Appl. 2023;14(4) [Google Scholar]

20. Kan X., Zhu S., Zhang Y., Qian C. A lightweight human fall detection network. Sensors. 2023;23(22):9069. [PMC free article] [PubMed] [Google Scholar]

21. Fan X., Gong Q., Fan R., Qian J., Zhu J., Xin Y., Shi P. Substation personnel fall detection based on improved YOLOX. Electronics. 2023;12(20):4328. [Google Scholar]

22. Lyu L., Liu Y., Xu X., Yan P., Zhang J. EFP-YOLO: a quantitative detection algorithm for marine benthic organisms. Ocean Coast Manag. 2023;243 [Google Scholar]

23. Abas S.M., Abdulazeez A.M., Zeebaree D.Q. A YOLO and convolutional neural network for the detection and classification of leukocytes in leukemia. Indonesian J. Electr. Eng. Computer Sci. 2022;25(1):200–213. [Google Scholar]

24. Hu J., Shen L., Sun G. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. Squeeze-and-excitation networks; pp. 7132–7141. [Google Scholar]

25. Woo S., Park J., Lee J.Y., Kweon I.S. In Proceedings of the European Conference on Computer Vision (ECCV) 2018. Cbam: convolutional block attention module; pp. 3–19. [Google Scholar]

26. Chen T., Ding Z., Li B. Elderly fall detection based on improved YOLOv5s network. IEEE Access. 2022;10:91273–91282. [Google Scholar]

27. Zhao D., Song T., Gao J., Li D., Niu Y. YOLO-fall: a novel convolutional neural network model for fall detection in open spaces. IEEE Access. 2024;12:26137–26149. [Google Scholar]

28. Wang Y., Chi Z., Liu M., Li G., Ding S. High-performance lightweight fall detection with an improved YOLOv5s algorithm. Machines. 2023;11(8):818. [Google Scholar]

29. Moreno-Torres J.G., Sáez J.A., Herrera F. Study on the impact of partition-induced dataset shift on k-Fold cross-validatio. IEEE Transact. Neural Networks Learn. Syst. 2012;23(8):1304–1312. [PubMed] [Google Scholar]

30. Divine G., Norton H.J., Hunt R., Dienemann J. A review of analysis and sample size calculation considerations for Wilcoxon tests. Anesth. Analg. 2013;117(3):699–710. [PubMed] [Google Scholar]

31. Redmon J., Farhadi A., Yolov3: an incremental improvement, arXiv preprint arXiv:1804.02767 (2018). http://arxiv.org/abs/1804.02767.

32. Bochkovskiy A., Wang C.Y., Liao H.Y.M. Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. 2020 https://arxiv.org/abs/2004.10934 [Google Scholar]

Articles from Heliyon are provided here courtesy of Elsevier