key: cord-0667502-m2usb0cc authors: Wang, Xin; Wu, Xilei; Meng, Huina; Fan, Yuhan; Shi, Jingang; Ding, Han; Wang, Fei title: Social Distancing Alert with Smartwatches date: 2022-05-12 journal: nan DOI: nan sha: 43f181b66e5006cc5cfd2fc84d7f46310b9bc95d doc_id: 667502 cord_uid: m2usb0cc Social distancing is an efficient public health practice during the COVID-19 pandemic. However, people would violate the social distancing practice unconsciously when they conduct some social activities such as handshaking, hugging, kissing on the face or forehead, etc. In this paper, we present SoDA, a social distancing practice violation alert system based on smartwatches, for preventing COVID-19 virus transmission. SoDA utilizes recordings of accelerometers and gyroscopes to recognize activities that may violate social distancing practice with simple yet effective Vision Transformer models. Extensive experiments over 10 volunteers and 1800+ samples demonstrate that SoDA achieves social activity recognition with the accuracy of 94.7%, 1.8% negative alert, and 2.2% missing alert. Social distancing, maintaining approximately 6 feet or 2 meters from others, is an efficient public health practice during the COVID-19 pandemic that aims to prevent people who are infected from coming in close contact with healthy people in order to reduce virus transmission. However, people would unconsciously violate the social distancing practice when conducting some daily social activities such as shaking hands, hugging, kissing on the face or forehead, etc. To reduce this unconscious violation, we propose SoDA which leverages accelerometers and gyroscopes of smartwatches to recognize these social activities, serving as an alert system for promoting users' adherence to the social distancing practice. There are flourishing works on social distance estimation after the outbreak of the COVID-19 [1] - [6] . For example, surveillance cameras are utilized to estimate the density of people for evaluating their adherence to the social distancing practice [1] . Some wireless communication technologies, such as Bluetooth Low Energy [2] , Wi-Fi [3] , and Ultra-Wide Band [4] , are adopted to localize people indoors. Besides, acoustic sensors [5] and magnetic sensors [6] are to estimate the distance between users who carry the sensory system. Still, we propose SoDA and highlight our motivations below. • Systems based on cameras or wireless communication technologies are proposed for business users, such as malls, schools, or metro stations. An individual person cannot receive a specific alert efficiently if she/he violates the social distancing practice. SoDA works along with users and can alert users specifically in time. • Systems based on acoustic sensors and magnetic sensors only estimate the distance between people who carry the *equal contribution, #corresponding author. . Some social activities, e.g. (a-j), naturally violate the social distancing practice during the COVID-19 pandemic. We propose SoDA, an intelligent approach based on smartwatches, to distinguish these ten social activities from other eight daily activities, i.e. (k-r), to serve as a violation alert for promoting users' adherence to the social distancing practice, reducing the transmission risk of COVID-19 virus. systems simultaneously. The strict cooperation seriously harms the usage. SoDA is only required to be worn by one user, and report the alert for activities that would violate the social distancing practice. As shown in Fig. 1 , SoDA, equipped with smartwatches, utilizes accelerometers and gyroscopes to characterize the hand movements and recognize activities that would naturally violate the social distancing practice. It has been demonstrated that applying smartwatches to activity recognition is practical. For example, smartwatches are designed to assess the tooth brushing procedure with the Bass tooth-brushing technique [7] , measure users' workouts [8] , and evaluate handwashing techniques in accordance with the WHO guidelines [9] . In this paper, we apply Vision Transformer (ViT) [ image into several patches as inputs and applies Transformer Encoder blocks [11] on the patches. Because Transformer Encoders compute the correlation between all patches, ViT can extract features in the short-range and long-range at the very beginning, which is effective for classification. Inspired by this design, we evenly divide the time series of accelerometers and gyroscopes into several clips as inputs, conducting nothing more on the raw data. Then we feed these clips into ViT and have the classification results. If the results fall into the first 10 categories shown in Fig. 1 (a-j), SoDA will report the social distancing practice violation to the user to promote her/his awareness of keeping sufficient social distance from others. Otherwise, if the results fall into the last 8 categories shown in Fig. 1 (k-r), SoDA keeps silent. To evaluate SoDA, we recruit 10 volunteers and ask them to repeat every action shown in Fig. 1 for 10 times, leading to a dataset with 1800 samples. We apply ViT to accomplish the action recognition task and achieve the mean accuracy of 95%. Compared with MLP-Mixer [12] , ResNet [13] , and Bi-LSTM models [14] , ViT performs best. The main contributions of our work in this paper are as follows. • We propose SoDA to detect users' activities that may violate the social distancing practice and remind them of the violation for promoting their awareness of the social distancing practice during the COVID-19 pandemic. • We collect a dataset on 10 volunteers with 1800+ samples. Extensive experimental results over the dataset show that SoDA is effective in activity recognition and social distancing alert. • The ablation study over five deep models demonstrates that ViT is a competing approach. We release code and envision its further use on more tasks and more modalities of time-serial data. The main workflow of SoDA is shown in Fig. 2 . In data recording, accelerometers and gyroscopes continuously record wrist movements when the user wears a smartwatch. In data processing, SoDA processes the recorded time series and outputs the action category. In activity distinguishing and distancing alert, the output will be compared with eighteen activities. If it belongs to social activities, SoDA will remind the user to keep sufficient distance from others. Otherwise, if it belongs to daily activities, SoDA will keep silent. Transformer [11] is originally proposed in natural language processing (NLP). Later, Vision Transformer (ViT) [10] opens up the spread applications of Transformer in computer vision. In this paper, we adopt ViT in SoDA for two reasons. (1) Simple. As shown in Fig. 3 , ViT is a Transformer Encoder [11] plus a multi-layer perceptron head for classification. Besides, raw data of accelerometers and gyroscopes are evenly divided into several clips and fed into a linear projection directly without any pre-processing. (2) Ease on generalization. ViT has bridged the gap between NLP and computer vision. We are expecting its wide use in processing time-serial data, and take a step to demonstrate the possibility. Model Inputs and Linear Projection. Given the input data of the accelerometers and gyroscopes x a ∈ R L×3 , x g ∈ R L×3 , where L is for the data length; 3 is the sensory dimension, we first evenly divide x a and x g into C clips and reshape each clip to R 3L C . Then we feed each clip into the Linear Projection to embed it with size of R h . Next, we add the embedded accelerometer clip (R h ), embedded gyroscope clip (R h ), and their position embedding (also R h ), to get the embedding features with the size of R h . Further, we concatenate all embedding features of C clips to E ∈ R C×h . At last, as shown in Fig. 3 , we concatenate E with a 0thposition embedding (R h ) and a random-clip embedding (R h ) for classification, and have the concatenated embedding, with size of R (C+1)×h , for Transformer Encoder. Transformer Encoder. Transformer Encoder is a powerful feature learner, which can be stacked in sequence. As shown in the right subfigure of Fig. 3 , one Transformer Encoder is comprised of a layer normalization (Norm) [15] , a multihead self-attention block (MSA), a multi-layer perception block (MLP), and two residual connections. MLP consists of two linear layers with a GELU non-linearity [16] in the middle. MSA is the key component of Transformer Encoder, described next. Multi-head Self-Attention (MSA). Recall that the inputs of Transformer Encoder are with the size of R (C+1)×h , after the layer normalization, the input of MSA is mapped to U ∈ R (C+1)×h . We first describe the single-head selfattention (SSA). In SSA, a matrix, The other two paralleled matrices, W K ∈ R h×h , and W V ∈ R h×h , also map U to Key matrix K ∈ R (C+1)×h and Value matrix V ∈ R (C+1)×h , respectively. The SSA mechanism can be written as follows. where QK T ∈ R (C+1)×(C+1) is to compute the correlation matrix between any two clips, normalized by √ h and softmax operator. The Value matrix V is re-weighted by the normalized correlation matrix, i.e., sof tmax( QK T √ h ), then serves as learned features of inputs. MSA is an advancement of SSA. In MSA, for example in The MSA mechanism can be written as Equation. 2. where, MSA first conducts attention mechanism over every Then attentions from all heads are concatenated and further merged by matrix W O ∈ R h×h . The result of MSA is M ultiHead(Q, K, V ) ∈ R (C+1)×h and serves as learned features for further processes. Model Outputs. Transformer Encoder outputs features with the size of (C +1)×h. As shown in Fig. 3 , the successive MLP Head takes the first row of the outputs (R h ) as input to classify the activity, e.g., handshake, hug, walk, etc. If the result falls into the first 10 categories shown in Fig. 1 (a-j) , SoDA will remind the user to keep sufficient social distance from others. Otherwise, if the result falls into the last 8 categories shown in Fig. 1 (k-r) , SoDA keeps silent. We construct the model with Pytorch 1.10.2 and train it with an RTX 3090 GPU. The training epoch is no more than 400. The early stop mechanism is set to save training time by stopping the training process if the model has been trained in adequate epochs and the training loss does not decrease over 40 epochs. We use AdamW [17] (β 1 = 0.9, β 2 = 0.999) to optimize the model. The initial learning rate is 0.0005 and decays by a ratio of 0.5 every 40 epochs. We recruit 10 subjects and let them wear a smartwatch branded Samsung Gear Sport on the left wrist. Subjects are required to repeat actions as shown in Fig. 1 , where (a-f) are etiquette habits with physical contact, (g) is the etiquette habit with closed distance, (h) and (i) are daily activities with physical contact, (j) is the daily activity with indirect contact, and (k-r) are other daily activities, for 10 times. During the activity conducted, data of accelerometers and gyroscopes as well as corresponding timestamps are recorded, resulting in a dataset with 1800 samples. In our experiments, we set the input length L = 224 for ViT. If the length of recorded samples is longer than 224, we will slice it into multiple segments, each with 224 sampling points, without overlapping. If the length of recorded samples or the sliced segments is within [40, 224), we apply zero-padding on the samples to enlarge the length to 224. Otherwise, if the length is smaller than 40, we discard these samples. After all these processes, the dataset is with 2061 samples. (1) Cross-Validation Result. Recall that we recruit 10 subjects to conduct 18 actions. For i-th action of j-th subject, we split the repeats into 5 groups according to the conducted time, denoted as G i,j 1 , G i,j 2 , ..., G i,j 5 . We apply this split strategy over all subjects and all actions, and have 5 groups of the dataset without overlap, denoted as G 1 , G 2 , ..., G 5 . We first utilize {G 2 , G 3 , G 4 , G 5 } as training data to train the model of ViT-MS/8 (see Table II ), leaving G 1 to test the trained model. As listed in Table I , the classification accuracy over G 1 is 0.914 (91.4%). We then utilize {G 3 , G 4 , G 5 , G 1 } as training data, leaving G 2 for testing. Similarly, we repeat this trainingthen-test as listed in Table I and have the mean accuracy of 0.947 (94.7%) over this 5-fold cross-validation. (2) Accuracy over subjects. We show the mean accuracy of the cross-validation procedure in another fine-grained view, i.e., accuracy over subjects. Denote the accuracy of the i-th activity as a i , which can be computed as follows. where |S i | is for the sample number of the i-th subjects; gt j i and p j i are for the ground-truth and prediction of the j-th sample of the i-th subject; p j i , gt j i returns 1 if p j i equals gt j i , otherwise, returns 0. As Fig. 4 shows, SoDA with ViT-MS/8 performs well over all subjects, especially over the 5th (0.995), 6th (0.990), 7th (0.981), and 8th (0.969) subjects. The mean accuracy over all subjects is 0.948. (3) Performance over activities. We further show the confusion matrix over activities in Fig. 5 , along with the number of samples, precision, and recall. As the figure shows, SoDA achieves good precision and recall over all the activities, especially over the 5th (kiss on the forehead), 6th (bow), 16th (drink water), and 17th (keystroke). One most error happens in the classification of the 0th activity (one-hand shake). This is because the smartwatch is not worn on this shaking hand. However, this error would not lead to much false silence (should alert but keep silent). As shown in Fig. 5 , most false silence happens at recognition on the 2nd activity (hug). Meanwhile, the most false alert (should be silent but alert) happens at recognition on the 10th activity (walk). Besides, precisions of success alert, success silence, false alert, and false silence are 0.982, 0.978, 0.018, and 0.022, respectively. (4) Ablation study on ViT variants. Model scale. The Model scale depends on the number of Transformer Encoder blocks, hidden dimensions, self-attention heads, etc. As listed in Table II , we set the ViT model in 3 scales, i.e., ES (extra small), MS (medium small), and S (small). As shown in Fig. 6 , when we expand the scale The results show that ViT models achieve best with similar FLOPs. All large models face the problem of over-fitting. Bi-LSTM is a competing method, however, training Bi-LSTM-S costs ×11.3 times than ViT-MS/32. from ES to MS, the accuracy increases. However, the accuracy decreases when we expand the scale to S, which indicates overfitting happens. Clip length. In each scale, the accuracy decreases when we lengthen the clip length, as shown in Table II . This indicates that decreasing the clip length (increasing clip number) could result in a better representation to achieve better accuracy. However, better accuracy is traded with more FLOPs. In the network design practice, the trade-off between accuracy and FLOPs deserves to be evaluated. We compare the results of ViT models with the MLP-Mixer [12] , ResNet [13] , and Bi-LSTM [14] models. The specification and accuracy of these models are listed in Table II . To facilitate the understanding of these values, we show the accuracy and FLOPs in Fig. 6 . The figure clearly shows that (1) ViT models work better than other models with similar To evaluate SoDA on unseen subjects, we apply the leaveone-subject-out (LOSO) manner. That is, for example, we first use the data of the 2nd-10th subjects to train a ViT-MS/8, and use the data of the 1st subject to testing the trained model. Then we use the data of the 3nd-10th and 1st subjects to train another ViT-MS/8, and use the data of the 2nd subject to testing the trained model. We conduct this LOSO for all subjects in sequence and have 10 trained ViT models to be evaluated. The evaluation is reported in Fig. 7 , which shows that the accuracies over the unseen subject are around 0.7-0.8, and the mean accuracy is 0.755. Further, we apply ViT-ES/8 and ViT-S/8 to the LOSO evaluation and list the accuracy in Table III . The table shows that the best performance over unseen subjects is achieved by the smallest ViT-ES/8, i.e., 0.779. Promoting the performance over unseen subjects should be valuable future work, thus we would like to release a dataset to facilitate research on this topic. In social distance detection, cameras are commonly used to collect videos of the monitored areas [1] . They measure the distance or density of people in videos and evaluate whether their social distance complies with regulations. It is a very intuitive method. However, it may have the effect of viewing angle occlusion, so that a part of the blocked people cannot be detected correctly. And cameras installed in private places may expose personal privacy. Positioning technologies, such as Bluetooth Low Energy (BLE) [2] , WiFi [3] , and Ultra-Wide Band (UWB) [4] , are also adopted to detect social distance. But most of these methods tend to evaluate indoor activities. Various sensors are also exploited to estimate the social distance of people, such as thermal sensors [18] , vibration sensors [19] , and magnetic field sensors [6] . As a wearable device equipped with multiple sensors, the smartwatch has many applications in the field of healthcare. Combined with deep learning algorithms, it can use data from accelerometer and gyroscope to perform action recognition, e.g. detecting falls of the elderly [20] , in this paper, the author adopts the bi-directional long short term memory (Bi-LSTM) neural network to classify the fall detection from the common daily activities. Besides, smartwatches have a variety of applications in sleep, such as capturing a range of information about sleep quality [21] , detecting early Parkinson's disease through sleep [22] , detecting sleep apnea [23] , and monitoring breathing rate and body movement during sleep [24] . These studies enable people to obtain a lot of health-related information from sleep merely with a single smartwatch. In this paper, we present SoDA, a smartwatch-based solution, to detect users' activities that may violate the social distancing practice, reminding people of the practice during the current COVID-19 pandemic to prevent virus transmission. To evaluate SoDA, we recruit 10 volunteers and build a dataset with 1800+ samples. Experimental results show that SoDA with simple ViT models is efficient to distinguish 10 social activities from daily activities with promising performance. Deserve to mention that we show ViT generalizes well from handling image data to the accelerometer and gyroscope data, with the great potentiality to generalize to more modalities of time-serial data. Auto-sda: Automated video-based social distancing analyzer Proxitrak: a robust solution to enforce real-time social distancing & contact tracing in enterprise scenario Crowdtracing: Overcrowding clustering and detection system for social distancing Social distance alert system to control virus spread using uwb rtls in corporate environments Smartdistance: A mobilebased positioning system for automatically monitoring social distance A wearable magnetic field based proximity sensing system for monitoring covid-19 social distancing Toothbrushing monitoring using wrist watch Fitcoach: Virtual fitness coach empowered by wearable mobile devices You can wash better: Daily handwashing assessment with smartwatches An image is worth 16x16 words: Transformers for image recognition at scale Attention is all you need Mlp-mixer: An all-mlp architecture for vision Deep residual learning for image recognition Framewise phoneme classification with bidirectional lstm and other neural network architectures Layer normalization Gaussian error linear units (gelus) Decoupled weight decay regularization A novel privacy-preserving approach for physical distancing measurement using thermal sensor array Social distancing compliance monitoring for covid-19 recovery through footstep-induced floor vibrations Deep learning based fall detection using smartwatches for healthcare applications Sleepguard: Capturing rich sleep information using smartwatch sensing data Smartwatch-based activity analysis during sleep for early parkinson's disease detection Apneadetector: Detecting sleep apnea with smartwatches Sleepmonitor: Monitoring respiratory rate and body position during sleep using smartwatch