key: cord-0437874-2xltkzf9 authors: Nowara, Ewa; McDuff, Daniel; Veeraraghavan, Ashok title: The Benefit of Distraction: Denoising Remote Vitals Measurements using Inverse Attention date: 2020-10-14 journal: nan DOI: nan sha: c6c00cd9b39e435775519033d737d82c5ad6dfad doc_id: 437874 cord_uid: 2xltkzf9 Attention is a powerful concept in computer vision. End-to-end networks that learn to focus selectively on regions of an image or video often perform strongly. However, other image regions, while not necessarily containing the signal of interest, may contain useful context. We present an approach that exploits the idea that statistics of noise may be shared between the regions that contain the signal of interest and those that do not. Our technique uses the inverse of an attention mask to generate a noise estimate that is then used to denoise temporal observations. We apply this to the task of camera-based physiological measurement. A convolutional attention network is used to learn which regions of a video contain the physiological signal and generate a preliminary estimate. A noise estimate is obtained by using the pixel intensities in the inverse regions of the learned attention mask, this in turn is used to refine the estimate of the physiological signal. We perform experiments on two large benchmark datasets and show that this approach produces state-of-the-art results, increasing the signal-to-noise ratio by up to 5.8 dB, reducing heart rate and breathing rate estimation error by as much as 30%, recovering subtle pulse waveform dynamics, and generalizing from RGB to NIR videos without retraining. Attention mechanisms have been successfully applied in many areas of machine learning and computer vision (Mnih et al., 2014; Vaswani et al., 2017) , including object detection (Oliva et al., 2003) , activity recognition (Sharma et al., 2015) , language tasks (Anderson et al., 2018; You et al., 2016) , machine translation (Bahdanau et al., 2014) , and camera-based physiological measurement (Chen & McDuff, 2018 ). An additional benefit of attention mechanisms is that they are interpretable and show which regions of an image were used to generate a particular output. In this paper, we focus on a counter-intuitive question -is there important information contained within the regions that are typically ignored by the attention models? And, can we exploit information in this region to improve the quality of estimation for the underlying signals of interest? We focus on the specific temporal prediction problem of camera-based physiological measurement as an exemplar application for our approach. The SARS-CoV-2 (COVID-19) pandemic has rapidly changed the face of healthcare, emphasizing the need for better technology to remotely provide care to patients. COVID-19 is linked to serious heart and respiration related symptoms (Xu et al., 2020; Zheng et al., 2020; Puntmann et al., 2020) . Even after the COVID-19 crisis, many doctor appointments could be carried out online with telemedicine technology, increasing the flexibility for appointments. Recent research in computer vision has led to the development of non-contact physiological measurement techniques that leverage cameras and computer vision algorithms (Takano & Ohta, 2007; Verkruysse et al., 2008; Poh et al., 2010a; De Haan & Jeanne, 2013; Wang et al., 2017; Chen & McDuff, 2018) . Camera-based vital signs could also enable driver monitoring (Nowara et al., 2018) , face anti-spoofing (Liu et al., 2020a; Nowara et al., 2017) , or long-term humancomputer-interaction (HCI) studies (McDuff et al., 2016) where wearing contact devices for extended periods may be infeasible. Convolutional networks currently provide state-of-the-art performance on heart rate (HR) and breathing rate (BR) measurement from video (Chen & McDuff, 2018; Yu et al., 2019; Liu et al., 2020b) . While the convolutional neural networks may be able to accurately learn what features in the image are important for finding the physiological signals, they may not be able to learn a good model of the noise that corrupts the signals. The noise present in the video, which is considered to be "everything else than the signal of interest", may be caused by many diverse factors and could vary greatly across videos and datasets. Possible sources of noise include changes in head motion (Estepp et al., 2014) , Skin pixels contain the strongest pulsatile signal, a typical attention network will learn to focus on these pixels. However, they also capture motion and lighting information. Time We propose to use an inverse attention mask to capture motion and lighting changes as a noise estimate and to use these to improve the pulse SNR and heart rate estimates. Norm. Amp. We propose an approach, using the regions ignored by the attention mechanism, that produces more accurate physiological waveforms, even in severely challenging scenarios. facial expressions (Zhang et al., 2016) , speech, ambient light variations (Nowara et al., 2018) , and video compression artifacts (Yu et al., 2019; Nowara & McDuff, 2019) . The wide variety of possible noise sources makes it challenging for any model to explicitly capture a good noise representation and to remove that noise from the signals of interest. The key observation we make is that regions ignored by an attention mechanism in a neural model likely contain information about sources of noise that are also present in the regions used by the attention mechanism to compute the physiological signals. Using the "distraction" regions that were ignored by the attention masks offers a way to estimate the noise for each video without making any assumptions about the nature of the noise. The only assumption is that most regions not used by the attention masks do not contain the signals of interest and consequently contain noise. We demonstrate that we can use the intensity variations from regions outside of the attention mask as a noise estimate and learn a denoising mapping to remove noise from the recovered signals. See Fig. 1 for an overview of our denoising approach. We show that our approach outperforms state-of-the-art methods on several datasets across a range of HR and BR error measures. Our denoising approach also generalizes well to new data, even data recorded with different imaging modalities, such as near-infrared (NIR), without any additional training. Our proposed approach is also able to recover very subtle waveform dynamics, such as the clearly visible dicrotic notch, shown in Fig. 2 , which is challenging for video-based methods. Obtaining clean and more accurate waveforms is useful for determining important health metrics, such as blood pressure (Elgendi et al., 2019) , which is infeasible with current methods. The idea of using the inverse attention regions is likely very useful in a wide variety of vision tasks, ranging from activity recognition to deblurring. However, in this work, we focus on physiological measurement due to the clinical importance. The core contributions of this paper are to: (1) propose the use of inverse attention masks for generating noise estimates, (2) present a novel method for denoising non-contact physiological measurement using this approach, (3) evaluate our method on three datasets showing state-of-the-art performance on pulse and respiration measurement, (4) demonstrate that our approach generalizes to NIR data without further training. Supplementary material including code, models, video examples and additional experimental results are provided with this submission. 1 2 RELATED WORK Attention Mechanisms. Attention mechanisms provide a way for a model to learn which parts of an image or video "are relevant for the task at hand and attach a higher importance to them" (Sharma et al., 2015) . During training the attention weights are learned reflecting the importance of the embedding features. Recently, transformer models, based solely on attention mechanisms, have become popular (Vaswani et al., 2017) . In convolutional neural networks (CNNs) these attention Figure 2: Pulse signals output by a state-of-the-art network and our denoising method. Our method produces cleaner signals, free from motion artifacts (present in the benchmark method), and better matching the ground truth subtle dynamics and shape. Notice the zoomed-in portions with easily identifiable dicrotic notch and diastolic peaks in our outputs. mechanisms typically form a spatial mask. These masks can help practitioners understand the decision-making process of a network (Fukui et al., 2019) and in certain cases the "fixations" of attention generated by computer models and by human observers were very similar (Oliva et al., 2003) . Attention mechanisms can be used to connect layers; for example, one which focuses on temporal information (e.g., trained on flows) and another which focuses on spatial information (e.g., trained on RGB frames). Prior work has found that these crosslink layers guide the spatial-stream to pay more attention to the human foreground areas and can be less affected by background clutter (Tran & Cheong, 2017) . In physiological measurement, two-layer networks have been found to be effective as both color and motion information are valuable for extracting the subtle physiological signal in the presence of noise (Chen & McDuff, 2018) . While attention mechanisms often work well, they are a simple representation of which regions are important. However, pixels outside these regions may provide useful context or a strong prior about the noise present. Physiological Imaging. Volumetric changes in blood over time lead to subtle changes in light reflected from the skin and subtle motion variations which can be measured with a camera (Takano & Ohta, 2007; Verkruysse et al., 2008) . The physiological signal obtained from a video can be used to recover several metrics and vital signs, including heart rate (Poh et al., 2010a) , heart rate variability (Poh et al., 2010b) , breathing rate (Poh et al., 2010b) , blood oxygenation (Tarassenko et al., 2014) and pulse transit time (Shao et al., 2014) . NIR (Nowara et al., 2018; and thermal cameras have also been successfully used for measuring physiological signals in the dark (Garbey et al., 2007; Pavlidis et al., 2016) . Unfortunately, the signals of interest in video-based physiological measurement are often very subtle and can be easily corrupted by noise due to body motions and ambient lighting changes. Early work in physiological imaging used properties of the physiological signal, e.g., the periodic nature (Poh et al., 2010a) and hemoglobin absorption spectra (De Haan & Jeanne, 2013; Wang et al., 2017) to recover the underlying physiological signal via de-mixing methods. Others have used physical skin models to learn a mapping from color changes . Recently, several groups have demonstrated that deep learning models free from heuristic assumptions about the signal structure can perform better, especially in presence of large motion and noise (Chen & McDuff, 2018; Zhan et al., 2019; Špetlík et al., 2018; McDuff, 2018; Niu et al., 2018) . We show that the performance of a state-of-the-art model is significantly improved by using the distraction regions as a noise estimate. Let us take a video of a person moving as an example. The skin pixels will contain information about the physiological signal but they will also capture the body motion, as in each frame the incident light changes with the orientation of the head (see Fig. 1 ). In contrast, the hair pixels will not contain information about the physiological signal (as there are no blood vessels in the hair) but will still contain information about the motion. In this section, we explain how we use those inverse attention (or "distraction") regions to denoise the physiological signals. The details of the proposed deep learning architecture are provided in Fig. 4 . The backbone of the encoder is formed using a convolutional attention network (CAN) (Chen & McDuff, 2018) . This contains appearance and motion branches learned jointly through an attention mechanism. The appearance model is trained directly on the input video frames. It learns from the color and texture information which regions in the video are likely to contain strong physiological signals. The motion model is trained on the difference of two consecutive video frames to differentiate between the intensity variations present in the video caused by the characteristic physiological variations from those from other sources. The attention mask then reflects a heatmap of the strength of the pulsatile signal in each region of the frame. As shown in the first row of Fig. 3 , the attention masks mostly focus on skin regions known to have strong physiological signals, while ignoring other regions, such as the eyes, hair, and background regions. The CAN normally outputs a single one-dimensional (1D) physiological signal estimate. However, we perform an element-wise multiplication of the original input frame with the inverse of the attention mask weights to compute a secondary noise estimate. We compute the noise signals at each time step by multiplying the inverse attention masks with each channel of each video frame in an element-wise manner. We then spatially average the resulting weighted pixel intensities to obtain the noise estimate: where I t and M t are the frame and mask at time t. N c,t is the noise estimate from each [R, G, B] camera channel c at time t, and H and W are the image height and width, respectively. The attention and the inverse attention masks were 34 × 34 pixels and the video frames were downsampled to the same size using bicubic interpolation. We normalize the attention mask elements to a range between 0 and 1. To obtain a noise estimate, we set all values larger than a fixed threshold, T, to 0 and everything else to 1, creating a binary mask. Based on the experiments we found a threshold of 0.1 worked well. This binary inverse attention mask ignores regions in the video initially used to compute the physiological signals and keeps all other regions. Examples of inverse attention masks and the corresponding noise estimates are shown in the second row of Fig. 3 . Figure 4 : Proposed denoising architecture. The encoder provides the initial physiological signal and the noise estimates to the LSTM at each time step which outputs a denoised physiological signal. Our denoising model is then formed as long short-term memory (LSTM) network with the encoder providing input at each time step. The goal is to learn a denoising function to further clean the physiological estimates. As input to the denoising LSTM we stacked the physiological signal and noise signal outputs generated by the encoder. The contact physiological signal (e.g., finger pulse oximeter) was used as ground truth for training. The noise estimates guide the LSTM to learn which waveform features are related to noise and which are related to the physiological signal of interest. The LSTM learns to suppress the noise from the physiological signal and outputs a cleaner waveform matching the ground truth physiological signal better (see the third row of Fig. 3 ). See the video in the supplementary material for more examples of noise estimates and denoised signals. We used a two-layer bidirectional LSTM, with 128 hidden units, trained for 10 epochs with Adam optimizer (Kingma & Ba, 2014) and MSE loss. Because the LSTM tends to work better on shorter sequences, we split each video to sequences of 60 samples, with 50% overlap between time windows, which corresponded to two seconds for the 30 frames per second (fps) videos. Physiological datasets are often relatively small due to the complexity associated with collecting carefully synchronized physiological signals and high-quality videos. Therefore, we implemented the CAN and the denoising LSTM as two separate networks to reduce the number of training parameters. Figure 5 : Examples of images used to evaluate our proposed approach. We evaluated our approach on the following two RGB video datasets and an NIR video dataset. AFRL (Estepp et al., 2014) : 300 videos of 25 participants were recorded as 658 × 492 pixel images at 120 fps. Fingertip reflectance photoplethysmograms (PPG), electrocardiograms (ECG), and respiratory signals were recorded as ground truth signals. We used the ECG signals to compute the HR estimation errors, the PPG signals to train the network for estimating HR, and respiratory signals for computing the errors and training the network for BR estimation. Each participant was recorded 12 times in each five minute experiment with varying motion and two different backgrounds. The participants: 1) sat still and rested their chin on a headrest, 2) sat still without the headrests, 3) moved their head horizontally at a speed of 10 degrees/second, 4) 20 degrees/second, 5) 30 degrees/second, 6) reoriented their head randomly once every second. We center-cropped the ARFL video frames to 492 × 492 pixels to remove the blank background areas. (Zhang et al., 2016) : 102 videos of 40 participants were recorded at 25 fps capturing 1040 × 1392 resolution images during spontaneous emotion elicitation experiments. Ground truth blood pressure (BP) wave was measured at 1000 fps and average HR updated after every heart beat. We used the blood pressure waves to train the network and the average HR to compute the HR estimation errors. 19 videos had erroneous average HR estimates, so we recomputed them using the BP wave. MR-NIRP (NIR) (Nowara et al., 2018) : Eight participants were recorded with a NIR camera. The videos were recorded at 640× 640 resolution and 30 fps. Fingertip transmission photoplethysmograms were recorded as ground truth signals. Each participant was recorded twice, once sitting still and once performing motion tasks involving talking and randomly moving the head. Because the background in MR-NIRP was not uniform, we applied face detection in the first video frame and cropped a rectangular region with 110% width and height of the detected bounding box. Training the Encoder. Due to the large number of parameters we pretrain the encoder on the largest dataset (AFRL (Estepp et al., 2014) ) and lock the weights. When training the encoder the loss is calculated as the mean squared error between the physiological estimate and the ground truth. We do not compute the loss using the noise estimate as there is no ground truth noise signal. In our experiments we performed training and testing separately for each of six motion tasks from the AFRL dataset with a participant-independent cross-validation, leaving out 20% of the participants in each validation split. For experiments on the MMSE-HR and MR-NIRP datasets we used the trained model from Task 2 as these contained the most similar amplitude motions. To maximize the diversity of the participants that this model was trained on to improve the generalizability to new datasets, we instead used a subject-dependent cross-validation, using four minutes of each video for training and one minute for testing. Training the Denoising Model. When evaluating on the AFRL dataset we trained the denoising model with the same subject-independent procedure as for the encoder on AFRL. The MMSE-HR dataset has fewer videos than the AFRL dataset; therefore, we used a leave-one-subject-out crossvalidation where we left out all videos of one subject and trained the model on all remaining videos, repeating this for each subject. The MR-NIRP dataset was small and not suited for training the networks, so we used the LSTM trained on the AFRL dataset. This allowed us to test cross-dataset generalization. We bandpass filtered ([0.7 Hz, 2.5 Hz]) and detrended the signals (Tarvainen et al., 2002) . We normalized the signals by subtracting the temporal mean, dividing by the temporal standard deviation in each video, and normalized their amplitudes to -1 and 1. We resampled all sequences to 30 fps. The signals from each video were divided into 30-second non-overlapping windows. We evaluated the performance of our proposed denoising approach using mean absolute error (MAE), root mean square error (RMSE), Pearson's correlation coefficient (ρ) between the estimated HR and the ground truth HR, SNR of the estimated physiological signals (De Haan & Jeanne, 2013) , and waveform mean absolute error (WMAE) computed between the estimated and the ground truth signal. See the supplementary material for the definitions of the error metrics. We compare four variants of our proposed approach to four state-of-the-art methods for recovering the pulse signal (Chen & McDuff, 2018; Poh et al., 2010a; De Haan & Jeanne, 2013; Wang et al., 2017) and two methods for recovering the breathing signal (Chen & McDuff, 2018; Tarassenko et al., 2014 ) (see the supplementary material for implementation and signal pre-processing details). The variants of our approach we compare are: training an our model with noise estimates as input ('Distraction") and without noise estimates as input ("No Noise") and subtracting the noise estimates from the physiological signal either in the frequency domain -"Freq. Sub." -or time domain -"Wave. Sub.". Heart Rate Estimation. Our method achieves lower HR MAE, RMSE and waveform MAE and higher HR correlation (ρ) and SNR (see Tables 1) compared to previous approaches on two large datasets. On the AFRL dataset the MAE is reduced from 2.93 beats per minute (BPM) to 2.25 BPM (25% reduction in error), and on the MMSE-HR dataset the MAE is reduced from 3.74 BPM to 2.50 BPM (33% reduction in error). This shows that information excluded by the attention mask can be successfully leveraged to remove noise, leading to substantial improvements in signal qual-ity. Moreover, the proposed denoising approach is able to recover the subtle waveform dynamics, reducing the waveform MAE by more than 50% on MMSE-HR. While simply subtracting the noise from the signals in the frequency domain often improved the SNR, it did not improve the heart rate estimates. Subtracting the noise signal in the time domain performed even worse and had a particularly negative impact on the BVP SNR. All results were statistically significant (p < 0.01) -see supplementary material for F-test results. Breathing Rate Estimation. In addition to estimating heart rate, which is based on intensity variations in the skin, our method can also be used to estimate breathing rate (BR) which is based on motion variations and is more challenging in presence of body motions. Only the AFRL dataset (Estepp et al., 2014) had gold standard reference breathing signals, therefore we were not able to evaluate our BR results on the other datasets. Our method achieves a reduction in MAE from 3.68 BPM to 2.44 BPM (a 34% error reduction) over the baselines and an increase in SNR of 5.87 dB (Table 1) . Our method also obtains cleaner breathing signals compared to the baseline (Fig. 6 ). True Benefit of Distraction Regions. Using our model without noise estimates from the output of the CAN to the LSTM works well when the signals do not change much over time, and when the noise in the training and test sets is similar, e.g., training and testing on AFRL (Table 1). However, including the distraction regions yields improvements in both HR and BR estimates when the signal varies over time or there is a large domain gap between training and testing sets. For example, distraction regions improve the performance on MMSE-HR which has sudden pulse variations, uncontrolled motion, and the presence of facial expressions and on the more challenging NIR MR-NRIP dataset (Table 1) . Moreover, including the distraction regions improves the HR and BR estimation accuracy when we train our model only on stationary videos of AFRL (Task 1) and test on videos with large random motions (Task 6) ( Table 2 ). SNR is often higher in the "no noise" condition because it simply produces a smoother signal leading to greater sparsity in the frequency domain. However, the dominant frequency of the signal (used to compute HR and BR) is often erroneous, resulting in higher MAE and RMSE, and lower ρ. These results show that the distraction signal is useful above and beyond including a temporal component to the model. Transfer learning. NIR videos of MR-NIRP are more challenging than RGB because the physiological signal is an order of magnitude weaker in the NIR range compared to the visible range, making it very prone to motion artifacts. When trained solely on RGB videos (AFRL dataset) without any fine-tuning, our method outperforms all the baselines across all five metrics on the NIR videos from the MR-NIRP dataset. As shown in Table 1 the MAE drops from 7.78 BPM to 2.34 BPM (70% reduction in error). Other baseline methods require multiple color channels and therefore cannot be applied to NIR videos. Inverse Mask Definition. We tested computing the inverse attention mask in two different ways. The first, as a matrix of continuous values in which each element of the inverse mask M , M i,j , was 1 -A i,j where A is the attention mask weights normalized from 0 to 1. The second approach was to threshold these values to create a binary mask where A i,j = 1, if A i,j >T. Where T is a threshold from 0 to 1. We found we obtain comparable results with binary (2.25 BPM) or continuous inverse attention masks (2.10 BPM). We also found that the results were not very sensitive to the value of T (see supplementary materials). Certain regions in the video may contain more useful information about the sources of noise than others. For example, regions closer to the face may contain more information about the motion of the participant, while regions farther in the background may contain more information about other sources of noise, such as illumination changes. We compared separately using noise estimates from distraction regions closer to the face (center of the frames) and further from the face (edges of the frames). When motion was small, all regions contributed similarly to denoising (MAE = 1.08 BPM with center regions and MAE = 1.07 BPM with edges). But when there was large head motion, regions close to the head (center of the frames) helped the most (MAE = 6.53 BPM with center regions and MAE = 8.74 BPM with edges). See supplementary materials for detailed results. Effect of Glasses. Interestingly, our method performed best on subjects who wore glasses. The attention masks for subjects with and without glasses were comparably good. However, CAN performed worse on subjects with glasses and our approach offered a large improvement on those videos (MAE [BPM] with glasses: Ours = 2.17, CAN = 3.33, and without glasses: Ours = 2.55, CAN = 2.57). See supplementary materials for example attention masks and results. We have presented a novel approach for generating noise estimates from inverse attention masks to improve camera-based physiological signal measurements. We hypothesized that the noise affecting regions used by the attention masks to compute the signal of interest would likely be present in other regions in the video which are ignored by the attention masks. Our proposed denoising method outperformed all state-of-the-art methods in heart rate and breathing rate estimation from videos. The recovered BVP signals are also sufficient to recover subtle waveform dynamics present in the ground truth contact signals, including the dicrotic notch and the diastolic peak. Our approach trained on RGB videos showed strong cross-dataset and cross-modality generalizability, outperforming the existing methods on challenging NIR videos. To evaluate the performance of our proposed approach we used the following four standard error measures (MAE, RMSE, Correlation, SNR), and we defined a new measure (Waveform MAE) to measure the waveform dynamics. Mean absolute error (MAE): where N is the total number of time windows, R i is the ground truth heart rate (HR) measured with a contact sensor for each 30 second time window and R i is the estimated HR from the video. Root Mean Square Error (RMSE): where S is the power spectrum of the estimated iPPG signal, f is the frequency in beats per minute (BPM) and U t (f ) is equal to one for frequencies around the first and second harmonic of the ground truth HR (HR-6 bpm to HR+6 bpm and 2*HR-6 bpm to 2*HR+6 bpm), and 0 everywhere else. Waveform Mean Absolute Error (WMAE): where W i is the ground truth pulse waveform obtained with the contact sensor for each 30 second time window and W i is the estimated pulse waveform from the video. We compared the performance of our proposed approach to state-of-the-art supervised method using a convolutional attention network (CAN) and three unsupervised methods described below. For the CHROM, ICA and POS methods face detection was first performed using MATLAB's face detection (vision.CascadeObjectDetector()). This was fixed for all methods, to avoid the influence of the face detector on performance. For the CAN method following the implementation in (Chen & McDuff, 2018) we did not use face detection but rather we passed the full frame to the network after cropping the center portion to make the frame a square with W=H. CHROM (De Haan & Jeanne, 2013) . This method uses a linear combination of the chrominance signals obtained from the RGB video. The [x R , x G , x B ] signals are filtered using a zero-phase, 3rd-order Butterworth bandpass filter with pass-band frequencies of [0.7 2.5] Hz. Following this, a moving window method of length 1.6 seconds (with overlapping windows and a step size of 0.8 seconds) is applied. Within each window the color signals are normalized by dividing by their mean value to give [x r ,x g ,x b ]. These signals are bandpass filtered using zero-phase forward and reverse 3rd-order Butterworth filters with pass-band frequencies of [0.7 2.5] Hz. The filtered signals [y r , y g , y b ] are then used to calculate S win : Where α is the ratio of the standard deviations of the filtered versions of A and B: B = 1.5y r + y g − 1.5y b The resulting outputs are scaled using a Hanning Window and summed with the subsequent window (with 50% overlap) to construct the final blood volume pulse (BVP) signal. ICA (Poh et al., 2010a) . This approach involves spatial averaging the pixels by color channel in the region of interest (ROI) for each frame to form time varying signals [x R , x G , x B ]. Following this, the observation signals are detrended. A Z-transform is applied to each of the detrended signals. The Independent Component Analysis (ICA) (JADE implementation) is applied to the normalized color signals. POS (Wang et al., 2017) . The intensity signals [x R , x G , x B ] are computed. A moving window of length 1.6 seconds (with overlapping windows and with a step size of one frame), is applied. For each time window, the signal is divided by its mean to give [x r ,x g ,x b ]. Following this, X s and Y s are calculated where: Y s = −2x r +x g +x b (10) X s and Y s are then used to calculate S win , where: The resulting outputs of the window-based analysis are used to construct the final BVP signal in an overlap add fashion. CAN (Chen & McDuff, 2018) Supervised convolutional attention neural network described in detail in the main text (Chen & McDuff, 2018) . Following the implementation in that paper we did not use face detection but rather we pass the full frame to the network after cropping the center portion to make the frame a square with W=H. Signal Pre-processing. We bandpass filtered the physiological signals and noise estimates to 0.7 Hz -2.5 Hz range and detrended them (Tarvainen et al., 2002) before feeding them into the LSTM. We set the detrending parameter λ for each dataset based on the video frame rate (λ = 500 for AFRL (Estepp et al., 2014) and λ = 50 for MMSE-HR (Zhang et al., 2016) and MR-NIRP (Nowara et al., 2018) .). We normalized the signals and noise estimates with AC/DC normalization by subtracting the temporal mean and dividing by the temporal standard deviation computed for each video. We additionally normalized the amplitude range of the signals, noise estimates and the ground truth signals to -1 and 1. Finally, we resampled all sequences to 30 fps. Noise Signal Definition. We compared the performance of our proposed denoising framework with noise channels computed from a single red, green or blue camera channel to using all three R, G, B channels. We hypothesized that the blue channel might be the best one for the noise representation for the physiological signals because the hemoglobin present in blood has the lowest absorption in the blue light spectrum and its intensity variations would be least related to blood flow. Conversely, the green channel could also be a useful noise representation, because it would contain information most similar to the physiological signals since the hemoglobin has the largest absorption in the green spectrum. However, we found that there is not a large difference between using any one of the single channels or all three channels. We report the detailed results in Table 4 on the AFRL dataset (Estepp et al., 2014) . Inverse Mask Definition. We also compared computing noise using a binary and a continuous inverse attention mask. The continuous mask was computed as a matrix of continuous values in which each element of the inverse mask M , M i,j , was 1 -A i,j where A is the attention mask weights normalized from 0 to 1. The binary mask was computed by thresholding these values, where A i,j = 1, if A i,j >T, where T is a threshold from 0 to 1. We found that we obtained comparable results with the binary and continuous masks as shown in Table 4 . Different Distraction Regions. We compared separately using noise estimates from distraction regions closer to the face ("Center" of the frames) and further from the face ("Edges" of the frames). We used an LSTM model trained on all ignored regions for this experiment. When motion was small, all regions contributed similarly to denoising. But when there was large head motion, regions close to the head (center of the frames) helped the most. See Table 5 . Effect of Glasses. We compared the performance of our denoising approach and the baseline CAN method on subjects with and without glasses. We found that our method offers largest improvements on subjects with glasses, as shown in Table 6 . However, the attention masks output by CAN on subjects with and without glasses were comparable, as shown in Figure 7 . Nine of the 25 subjects in the AFRL dataset were wearing glasses. No subjects in the MMSE-HR or MR-NIRP datasets were wearing glasses. Bottom-up and top-down attention for image captioning and visual question answering Neural machine translation by jointly learning to align and translate Deepphys: Video-based physiological measurement using convolutional attention networks Estimating carotid pulse and breathing rate from near-infrared video of the neck Robust pulse rate from chrominance-based rppg The use of photoplethysmography for assessing hypertension Recovering pulse rate during motion artifact with a multi-imager array for non-contact imaging photoplethysmography Attention branch network: Learning of attention mechanism for visual explanation Contact-free measurement of cardiac pulse based on the analysis of thermal imagery Adam: A method for stochastic optimization Temporal similarity analysis of remote photoplethysmography for fast 3d mask face presentation attack detection Multi-task temporal shift attention networks for on-device contactless vitals measurement Deep super resolution for recovering physiological information from videos Cogcam: Contact-free measurement of cognitive stress during computer tasks with a digital camera A fast non-contact imaging photoplethysmography method using a tissue-like model Recurrent models of visual attention Synrhythm: Learning a deep heart rate estimator from general to specific Combating the impact of video compression on non-contact vital sign measurement using supervised learning Ppgsecure: Biometric presentation attack detection using photoplethysmograms Sparseppg: towards driver monitoring using camera-based vital signs estimation in near-infrared Top-down control of visual attention in object detection Dissecting driver behaviors under cognitive, emotional, sensorimotor, and mixed stressors Non-contact, automated cardiac pulse measurements using video imaging and blind source separation Advancements in noncontact, multiparameter physiological measurements using a webcam Outcomes of cardiovascular magnetic resonance imaging in patients recently recovered from coronavirus disease 2019 (covid-19) Noncontact monitoring breathing pattern, exhalation flow rate and pulse transit time Action recognition using visual attention Visual heart rate estimation with convolutional neural network Heart rate measurement based on a time-lapse image Non-contact video-based vital sign monitoring using ambient light and auto-regressive models An advanced detrending method with application to hrv analysis Two-stream flow-guided convolutional attention networks for action recognition Attention is all you need Remote plethysmographic imaging using ambient light Algorithmic principles of remote ppg Pathological findings of covid-19 associated with acute respiratory distress syndrome. The Lancet respiratory medicine Image captioning with semantic attention Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement Analysis of cnn-based remote-ppg to understand limitations and sensitivities Multimodal spontaneous emotion corpus for human behavior analysis Covid-19 and the cardiovascular system