key: cord-0587967-86thff1w
authors: Meng, Huina; Wu, Xilei; Wang, Xin; Fan, Yuhan; Shi, Jingang; Ding, Han; Wang, Fei
title: Mask Wearing Status Estimation with Smartwatches
date: 2022-05-12
journal: nan
DOI: nan
sha: f9db591a9ad82717e3dcca41ba7d462788c33a09
doc_id: 587967
cord_uid: 86thff1w

We present MaskReminder, an automatic mask-wearing status estimation system based on smartwatches, to remind users who may be exposed to the COVID-19 virus transmission scenarios, to wear a mask. MaskReminder with the powerful MLP-Mixer deep learning model can effectively learn long-short range information from the inertial measurement unit readings, and can recognize the mask-related hand movements such as wearing a mask, lowering the metal strap of the mask, removing the strap from behind one side of the ears, etc. Extensive experiments on 20 volunteers and 8000+ data samples show that the average recognition accuracy is 89%. Moreover, MaskReminder is capable to remind a user to wear with a success rate of 90% even in the user-independent setting.

As World Health Organization (WHO) advised the public, wearing a mask is one of the simple yet critical precautions to suppress transmission and save lives during the COVID-19 pandemic [1]. The purposes of wearing a mask are basically protecting healthy people from becoming infected, or preventing transmission from a wearer who is infected and may or may not have symptoms. Thus, many people should wear a mask in scenarios such as when they are caring for cases of COVID-19, awaiting in queues for COVID-19 tests, with suggestive symptoms of COVID-19, etc. Considering the importance of wearing masks, in this paper, we propose a smartwatch-based mask-wearing reminder, named MaskReminder, for those who are involved in the transmission scenarios mentioned but not limited to above.

MaskReminder is possible based on two clear observations. First, if we wear a mask, the wearing activity, comprised of a sequence of steps, makes it distinguishable from other activities such as walking, running, typing, etc. Second, if we have already worn a mask, we always tend to do some related activities, e.g., removing the strap from behind one side of ears shortly for food or breath, lowering the top metal strap to expose the mouth for breath, pinching the metal strap occasionally to mold the shape of the nose, etc. We utilize the inertial measurement unit (IMU) of smartwatches, i.e., accelerators and gyroscopes, to characterize the user's hand movements, and enable MaskReminder to estimate the status of whether a user wears a mask or not.

Broadly, characterizing the readings of IMU is the wellknown task of time series recognition. The characteristics of time series include local/short-range information such as *equal contribution, #corresponding author. shapelet, saltation and trend, as well as global/long-range information such as extreme values, seasonal periods and shape. Thus, our first challenge is to extract informative and distinguishable representations from both local and global aspects of the accelerators and gyroscopes.

We find MLP-Mixer [2] is a suitable algorithm to overcome our challenge. In MaskReminder, we first cut a time series into several clips, then apply a multi-layer perceptron (MLP) on every individual clip, which aims to learn intra-clip (local/shortrange) features from the data of accelerators and gyroscopes. To learn inter-clip (global/long-range) features, MLPs are applied to the feature space dimension by dimension across all clips. The intra-clip MLPs and inter-clip MLPs are applied cyclically for more discriminative features to estimate the mask-wearing status of users.

Other challenges arise from the diversity of hand movements. (1) Diversity in activity. MaskReminder estimates the mask-wearing status via the hand activities such as wearing a mask, removing the strap from behind the ears, and lowering the metal strap. However, hand movements that interact with the face or head e.g., rubbing eyes, wearing glasses, and wearing a hat, may mislead MaskReminder. To overcome this problem, we collect data of additional 11 activities, as shown in Fig. 3 , and enable MaskReminder to resist the misleading of these activities in mask-wearing status estimation. ( estimation based on 4.5-second recordings, features of fast activities may be drowned, leading to a missing estimation. To solve this problem, we compute the average lasting time of all activities, as shown in Fig. 3 , and set the estimation duration to 2.56 seconds (2.56×50Hz=128 points). This parameter serves as the crop operation on the activities of wearing a mask, making MaskReminder detect mask-wearing even not see the whole procedure. Significantly, this parameter will introduce extra sampling points into estimation recordings of fast activities, enabling MaskReminder resilient to noise introduced by these extra points. We recruit 20 participants to evaluate the performance of MaskReminder. We adopt MLP-Mixer to estimate maskwearing status and achieved 89% accuracy. When compared to ResNet [3] and SVM, MLP-Mixer outperformed. Contributions of this paper can be summarized as three folds.

(1) We propose MaskReminder, a smartwatches-based prototype, to estimate whether the user wears a mask, for reminding those who may be involved in the COVID-19 virus transmission scenarios.

(2) We adapt MLP-Mixer and demonstrate its advances in learning local and global information from time-series data.

(3) We experiment on 20 participants and 8000+ data samples, which shows that MaskReminder can accurately estimate the mask-wearing status in both the user-dependent evaluation and the user-independent evaluation.

MaskReminder intends to remind users who may be involved in COVID-19 virus transmission scenarios to wear a mask for self-protection. To build MaskReminder, we follow the standard supervised learning workflow, i.e., train MaskReminder with annotated data and then deploy it for use. In the training phase, we recruit volunteers to wear smartwatches to conduct activities shown in Fig. 3 , collect recordings of accelerometers and gyroscopes, annotate these recordings with the corresponding category of activities, and train MaskReminder with paired recordings and annotations. In the use phase, MaskReminder reads the recordings of accelerometers and gyroscopes of smartwatches continuously and conducts mask-wearing status estimation over the recordings. MaskReminder will keep silent if it detects the user has worn a mask in a recent period. Otherwise, it will pop up a notification to alert the user to wear a mask. Fig. 2 demonstrates the workflow.

We are to formulate the training of MaskReminder mentioned above. After data collection and labeling, we have paired recordings (from the accelerometers and gyroscopes) and annotations, denoted as A, G, and y respectively. Given that we have N paired training data, our goal is to propose and train a model W that has minimize accumulative distances between its predictions and annotations.

In MaskReminder, we utilize a MLP-Mixer variant, a deep network, as W, described below.

(1) Inputs. Recall that accelemetors and gyroscopes measure physical values in three dimensions. As shown in Fig. 4 , inputs of MLP-Mixer, A i and G i , are both with size of L × 3, where L is for the length of A i and G i . Referring ViT [4] , we and g j i ∈ R L c ×3 , where a j i and g j i are for the j−th clip of the A i and G i , respectively. All input recordings are raw data, i.e., without any pre-processing.

(2) Per-clip Fully-connected. Each pair of a j i and g j i is concatenated along the 2nd dimension to be

c . The per-clip fully connected module is implemented via one fully connected layer without bias to maps x j i → e j i ∈ R h . This module works as a lightweight embedding function to convert inputs from raw data space to hidden feature space.

(3) Mixer Layer. The Mixer layer is the key component of MLP-Mixer, which is capable of efficiently mining the intraclip (local) and inter-clip (global) representation of inputs for hand activity recognition, described in Sec. II-B.

(4) Classification Head. The head consists of a global average pooling and a fully connected layer. Given that the input size of global average pooling is c × h * , where h * is for the output dimension of Mixer layers on each e j i . The global average pooling conducts the pooling operation across the clip dimension, which further mixes features from all clips to gain a global view for the hand activity classification and maskwearing status estimation.

The Mixer layer serves as the key representation learner in the middle of MLP-Mixer, as shown in Fig. 4 . We further illustrate its details in Fig. 5 . Recall that the per-clip fullyconnected module of MLP-Mixer maps each divided clip x j i → e j i ∈ R h , thus c clips lead to the inputs of Mixer layer e i ∈ R c×h . The Mixer layer conducts {FC (fully-connected), GELU (Gaussian Error Linear Unit) [5] , FC}, across the columns and rows of e i respectively to learn the global and local representation of inputs respectively.

(1) Inter-clip mixing for global representation. As shown in Fig. 5 , we first apply Layer Normalization [6] on e i and get LN(e i ). Then we transpose LN(e i ) to LN T (e i ) ∈ R h×c , and conduct {FC, GELU, FC} operations along every row of LN T (e i ), i.e., every column of LN(e i ). For elements of every column of LN(e i ) are comprised of features from all clips in the whole recordings, this sequence of operations is to mix feature of inter-clips, learning global representation of recordings. We represent these operations as follows.

where σ is for GELU, a non-linear activation function; W 1 is for parameters of the fully-connected layer; W 2 is for parameters of the fully-connected layer; (u i ) is the output.

(2) Intra-clip mixing for local representation. As shown in Fig. 5 , once we have u i , we first transpose it to u T i ∈ R c×h . Then we apply Layer Normalization on u T i and get LN(u T i ). , and conduct {FC, GELU, FC} operations along every row of LN(u T i ). For elements of every row of LN(u i ) T are comprised of features from one clip, this sequence of operations is to mix features of intra-clip, learning local representation of every clip. We represent these operations as follows.

where σ is for GELU, a non-linear activation function; W 3 is for parameters of the third fully-connected layer; W 4 is for parameters of the fourth fully-connected layer; (v i ) is the output.

One Mixer layer consists of operations of Eq. 2 and Eq. 3, which conducts intra-clip mixing and inter-clip mixing for global and local representation learning. As shown in Fig. 5 , we can stack multiple Mixer layers for a more powerful representation to estimate the mask-wearing status with the CrossEntropy loss function. We show the pseudocode of MLP-Mixer with Pytorch-like style in Algorithm 1.

We implement MLP-Mixer with Pytorch 1.10.2 and train it with a single RTX3090. We train the network to a maximum of 400 epochs, and set up an early stop mechanism. The patience on monitoring training loss is 40. The initial learning rate is 0.0005, which decays by 0.5 every 40 epochs. We use AdamW (β 1 = 0.9, β 2 = 0.999) to optimize MLP-Mixer.

We recruit 20 volunteers, 7 men and 13 women, to conduct actions shown in Fig. 3 . These actions are divided into three categories, i.e., (1) Wearing a mask. (2) Mask-wearing related actions, e.g., adjusting the mask to ensure no gap at the face and nose, pulling down the mask for breath. (3) Actions that may mislead mask-wearing status estimation, e.g., rubbing eyes or nose, putting on or taking off a hat or earphones. Each volunteer repeats each action 20 times (10 times when smartwatch on the right hand, 10 times on the left hand). We use a Samsung Gear Sport smartwatch to record readings of accelerometers and gyroscopes with 50Hz, and corresponding timestamps. Meanwhile, we use a camera to record video streams and corresponding timestamps. Timestamps from the smartwatch and camera are used to synchronize IMU readings and videos. Then we replay and watch video streams to label the start time and end time of each action repeated, with which we segment IMU readings. In all, we have a dataset with 7200 segments of IMU readings (20 volunteers×18 actions×2 hands×10 repeats).

For these segments with different lengths, we normalize their length to 128 as follows. (1) If the length of one segment L is less than 50, we discard this segment from the dataset. (2) If L ∈ [50, 128], we append 128−L zeros to the segments. (3) If L > 128, we cut the segment into multiple clips by every 128 sampling points without overlapping, and conduct (1) and (2) over the last clip. After this normalization operation, the dataset is with 8039 segments. 

We evaluate MaskReminder with the collected dataset in user-dependent manner and user-independent manner as below.

(1) User-dependent. For segments of each action of each volunteer, we divide them into 5 groups according to the order of conducted time, denoted as G 1 , G 2 , G 3 , G 4 , G 5 . We first apply G 2 − G 5 to train the MLP-Mixer and test the trained MLP-mixer with G 1 . Then we use G 3 − G 5 and G 1 to train and G 2 to test. We apply this leave-one-group-out evaluation across all groups, obtain 5 trained models, and report the performance next.

• Table I shows Mixer/MS/8 achieves the action recognition accuracy of 0.89, outperforming all other MLP-Mixers and demonstrating that MaskReminder performs well to estimate mask-wearing status.

• Table I indicates an over-fitting phenomenon happens. That is, for example, the largest model (MLP-Mixer/S/8 with 278 MFLOPs) is inferior to a smaller one, e.g., MLP-Mixer/MS/8 with 36.03 MFLOPs. Further, we visualize the results into 3 groups according to parameters in Fig. 6 , i.e., ES, MS, and S, which shows models with more FLOPs perform better when with similar parameters.

• Table I also shows that MLP-Mixers largely outperform a traditional method, i.e., Support Vector Machine (SVM), and a modern method, i.e., ResNet. SVM (statistic) is computed with statistic features of IMU segments such as the average, skewness, kurtosis, entropy, etc., as in [7] . SVM (simple) is computed with raw IMU readings as features. ResNets are computed by replacing the 2D convolution and 2D pooling with 1D convolution and 1D pooling that swiping along the time dimension of IMU segments, inspired by [8] .

• Fig. 7 shows the confusion matrix of the action recognition, which shows that MaskReminder can accurately (≈0.90) recognize all actions. The most errors occur in the prediction between two very similar actions, i.e., the 8th action (rubbing eyes) and the 9th action (rubbing the nose).

• Recall that we recruit 20 volunteers to evaluate MaskReminder. We compute the action recognition accuracy of each volunteer. Fig 8 shows that the average accuracy of the recognition of the wearing state of the mask is 0.89, and the highest accuracy rate can reach 0.97.

• Recall that volunteers wear the smartwatch on their right hand and left hand respectively. We further evaluate the trained models on segments of the right hand and left hand respectively. The mean accuracy of the right hand and left hand is 0.91 and 0.87, respectively. That is why, for most volunteers, the right hand is their dominant hand, IMU readings from the right hand can provide more movement-related information.

(2) User-independent. In this manner, we train MLP-Mixer with segments of 19 out of 20 volunteers, and test the trained model with the segments of the remaining volunteer. We apply this leave-one-user-out evaluation across all volunteers, obtain 20 trained models, and show the results in Fig. 9 .

As Fig. 9 shows, when we train MLP-Mixer/MS/8 with segments of the 2nd to 20th volunteers and evaluate the train model with segments of the 1st volunteer, the accuracy is 0.52, not quite satisfactory. However, it is worth mentioning that MaskReminder is designed to report whether the user is wearing a mask. Thus, if actions that indicate the maskwearing status, e.g., 7th to 18th in Fig. 3 , are classified as actions of 7th to 18th (even if not correctly classified as the target action), MaskReminder can still remind successfully. We call this success reminder. Similarly, if MaskReminder classifies one segment of 1st to 6th in Fig. 3 as the 1st to 6th, it can keep successfully silent. This binary classification largely reduces the requirement of MaskReminder. Overall, the mean accuracy, rate of success reminder, and rate of success silence are 0.62, 0.90, and 0.89, respectively, indicating MaskReminder is still a promising approach for estimating the mask-wearing status in the user-independent setting for unseen users.

Some works are proposed to detect whether people are wearing masks with the algorithms on images or videos [9]- [11] , but these works can only be applied at places where cameras are deployed, thus cannot give users a timely reminder to help them wear masks to block the spread of the virus. In [12] , self-compliance with personal management regulations and conscious wearing of masks are more effective in preventing the spread of infectious diseases than from outside supervision. Therefore, [13] proposed a method to prevent infectious diseases by using a wristband with an inertial measurement unit (IMU), to detect whether the person is wearing a mask. However, since wristband devices are rarely used and are not aesthetically pleasing or convenient. So we choose the smartwatch for the MaskReminder system.

Smartwatches already perform well in hand movement tracking [14] . In this context, a series of studies of human activity recognition have emerged. The authors in [15] utilize machine learning approaches to recognize and track people's hand movements. [16] proposes a prototype system to recognize American sign language using a long short-term memory model. Smokewatch [17] is a smartwatch application that uses sensors to recognize hand movements and help smokers willing to quit. In [18] , the author extracts the magnitude of hand movements from the smartwatch data, through a support vector machine (SVM) classifier that detects drivers' drowsiness. UWash [19] can process sample-wise handwashing gesture classification using a unified U-Net variant.

In this paper, we present a smartwatch-based mask reminder system -MaskReminder, which can detect user's hand movements via the built-in IMU sensors of smartwatches and estimate the mask-wearing status. MaskReminder adopts MLP-Mixer models to learn the local and global information from time-serial IMU readings. Extensive experimental results from over 20 participants demonstrate that MaskReminder can accurately estimate the mask-wearing status in both the userdependent evaluation and the user-independent evaluation. We envision MaskReminder expanding the functionality of smartwatches for preventing virus transmission in the current COVID-19 pandemic in people's daily life.

Mlp-mixer: An all-mlp architecture for vision

Deep residual learning for image recognition

An image is worth 16x16 words: Transformers for image recognition at scale

Gaussian error linear units (gelus)

Layer normalization

A practical approach for recognizing eating moments with wrist-mounted inertial sensing

Joint activity recognition and indoor localization with wifi fingerprints

Real-time mask identification for covid-19: An edge-computing-based deep learning framework

Covid-19 face mask detection using tensorflow, keras and opencv

Mask or nonmask? robust face mask detector via triplet-consistency representation learning

Mask or no mask for covid-19: A public health and market study

Are you wearing a mask? detecting if a person wears a mask using a wristband

I am a smartwatch and i can track my user's arm

Smartwatch-based early gesture detection 8 trajectory tracking for interactive gesture-driven applications

Signspeaker: A real-time, high-precision smartwatch-based sign language translator

Smokewatch: A smartwatch smoking cessation assistant

Standalone wearable driver drowsiness detection system in a smartwatch

You can wash better: Daily handwashing assessment with smartwatches