key: cord-0661312-w2m578cx
authors: Ren, Yuzhuo; Syrnyk, Braeden; Avadhanam, Niranjan
title: Dual Attention Network for Heart Rate and Respiratory Rate Estimation
date: 2021-10-31
journal: nan
DOI: nan
sha: 252860d1813c302eb01cf1f5e3a6fa630521178a
doc_id: 661312
cord_uid: w2m578cx

Heart rate and respiratory rate measurement is a vital step for diagnosing many diseases. Non-contact camera based physiological measurement is more accessible and convenient in Telehealth nowadays than contact instruments such as fingertip oximeters since non-contact methods reduce risk of infection. However, remote physiological signal measurement is challenging due to environment illumination variations, head motion, facial expression, etc. It's also desirable to have a unified network which could estimate both heart rate and respiratory rate to reduce system complexity and latency. We propose a convolutional neural network which leverages spatial attention and channel attention, which we call it dual attention network (DAN) to jointly estimate heart rate and respiratory rate with camera video as input. Extensive experiments demonstrate that our proposed system significantly improves heart rate and respiratory rate measurement accuracy.

Non-contact camera-based physiological measurement is a fast growing research field and draws significant attention especially during the COVID-19 pandemic. Non-contact camerabased physiological measurement reduces infection risks and enables Telehealth, remote health monitoring and smart hospitals [1] . The underlying principle for camera-based physiological measurement is capturing subtle skin color changes [2] or subtle motions [3] caused by blood circulation. Camera-based physiological measurement involves capturing subtle changes from the body caused by light reflection. Imaging techniques can be used to measure volumetric changes of blood in the surface of the skin by capturing subtle skin color and motion changes. Imaging Photoplethysmography (iPPG) technology is based on the measurement of subtle changes in light reflected from the skin. Image Ballistocardiogram (iBCG) technology is based on the measurement of mechanical force of blood pumping around the body which causes subtle motions. Both heart rate and respiratory rate can be recovered using iPPG or iBCG based methods [4] - [7] . Camera-based heart rate and respiratory rate estimation is challenging because the skin color change and motions caused by blood circulation is so subtle that it's easily corrupted by environment illumination variations, head motion, facial expression, etc.

Traditional computer vision based heart rate and respiratory rate estimation involves several components in order to get [8] . TS-DAN leverages both spatial attention and channel-wise attention. Each of these models can be applied in a single or multi-task fashion.

robust measurements, such as face tracking [9] , skin segmentation [10] , heart rate or respiratory rate frequency band filtering [11] , principle component analysis (PCA) [12] , etc. Various algorithms have been proposed to improve robustness under each challenging scenarios, such as environment illumination change, head motion, facial expression, etc. Recently proposed convolutional neural networks enable end to end learning of the heart rate and respiratory rate and leverages big data which greatly outperforms traditional hand-crafted feature based methods especially for challenging cases. Several recent approaches have shown the benefit of enhancing spatial representation via spatial attention and channel-wise representation via channel-wise attention to boost the representational power of a convolutional neural network in various research fields. Fig. 1 shows our proposed temporal shift dual attention network (TS-DAN) for estimating heart rate and respiratory rate compared to previously proposed temporal shift convolutional attention network (TS-CAN). Each of the networks can estimate heart rate and respiratory rate individually or jointly estimate in a multitask fashion. In our work, we leverage both spatial attention and channel-wise attention to improve convolutional neural network for heart rate and respiratory rate estimation. We summarize of our contributions as follows: We apply 

We present a multi-task temporal shift dual attention convolutional network (MT-TS-DAN) for joint heart rate and breath rate measurement.

attention mechanism in both spatial domain and channel-wise domain to improve network accuracy. Spatial domain attention enhances spatial encoding which locates facial regions that contain strong physiological signal response. Channel-wise domain attention recalibrates channel-wise feature responses to select the most informative features. To our best knowledge, our proposed network is the first one leveraging both spatial attention and channel-wise attention, existing networks for heart rate and respiratory rate estimation do not have both spatial attention and channel-wise attention in the network architecture.

Traditional Hand-crafted Feature Method. Haan et al. [13] proposed CHROM method which leverages light absorption differences among R, G and B channels to conduct noise reduction. Wang et al. [7] improved motion robustness of CHROM method by using spatially redundant pixel-sensors of a camera, and leveraged artifacts as additional input channels to discriminate pulse and distortions [14] . Wang et al. [11] proposed sub-band pulse extraction to suppress periodic motions to particularly improve heart rate estimation robustness in fitness scenarios. Lewandowska et al. [12] proposed using channel selection and PCA algorithm to separate heart rate signal and noise. RGBIR sensor has also been proposed for physiological signal estimation in order to leverage additional IR channel to improve signal robustness and reduce noise [15] .

Convolutional Neural Network Method. Recent CNN based solutions greatly improve robustness of camera-based physiological measurement. Chen et al. [5] proposed a two branch network taking two consecutive frames' face crop difference as motion map and original frame's face crop as appearance map as input with spatial attention mechanism to improve accuracy in head motion cases. Liu et al. [8] further improved Chen et al. [5] 's network by adding temporal shift module [16] and multitask learning for heart rate and respiratory rate estimation. However, multitask learning decreases heart rate and respiratory rate accuracy because the same network is used to learn both heart rate and respiratory rate. Niu et al. [17] proposed to use spatial-temporal map with a deeper backbone network for end to end heart rate estimation. Several CNNs are proposed to estimate heart rate from a highly compressed video [18] , [19] .

Temporal Module. Since heart rate and respiratory rate is estimated in a time window, temporal process is crucial to improve accuracy than frame based method. 3D convolution based methods was proposed in [8] to replace 2D convolution in Chen et al. [5] 's architecture, though it gives better accuracy than 2D convolution module, the complexity is much higher than 2D module. To reduce complexity of 3D convolution, Lin et al. [16] proposed temporal shift (TS) module which shifts part of the channels along the temporal dimension to facilitate information exchange among neighborhood frames. TS module achieves the accuracy of 3D CNN and maintains 2D CNN's complexity as well. It can be inserted into 2D CNNs to achieve temporal modeling. TS module has been widely used in video understanding [20] , gesture recognition [21] and activity recognition [22] .

Attention Models. Spatial attention [23] captures spatial relationship between features and channel-wise attention [24] recalibrates channel-wise feature responses to select the most informative features. Spatial attention and channel-wise attention modules have greatly improve accuracy in various computer vision tasks, such as image classification [24] , segmentation [25] , etc. Squeeze and Excitation Network (SEN) [24] significantly improves image classification accuracy by introducing channel-wise attention modules. Fu et al. [25] appends positional attention and channel attention on top of Fully Convolutional Network (FCN) [26] to improve image segmentation. Wang et al. [27] proposed Efficient Channel Attention (ECA) module to improve efficient of SEN. ECA avoids dimensionality reduction and adds cross-channel interaction which preserves performance while significantly decreases model complexity.

For the theoretical optical principle of the model, we follow Shafer's dichromatic reflection model (DRM) [9] to model lighting reflection and physiological signals. RGB value of the k-th skin pixel in an image can be defined by a time-varying function [5] , [8] :

(1)

where C k (t) denotes a vector of the RGB values; I(t) is the illuminace intensity; v s (t) and v d (t) are specular and diffusion reflection respectively; v n (t) denotes camera sensor's quantization noise. I(t), v s (t) and v d (t) can all be decomposed into stationary part (i.e., I 0 , u s · s 0 , u d · d 0 ) and time-varying part (i.e., I 0 · Ψ(·), u s · Φ(·), u p · Θ(·)) [9] , where m(t) denotes all non-physiological variations such as illumination variations from light source, head motion and facial expressions; Θ(b(t), r(t)) denotes time-varying physiological signal which is a combination of both pulse b(t) and respiration r(t) information; Ψ(·) denotes the intensity variation observed by camera; Φ(·) denotes the varying parts of the specular reflections; u s and u d denotes the unit color vector of the light source and skin-tissue respectively; u p denotes the relative pulsatile strengths. I 0 denotes stationary part of illuminance intensity; s 0 and d 0 denotes the stationary specular and diffusion reflection respectively.

Skin reflection model in Eq. 1 demonstrates that the relation between RGB value of k-th skin pixel C k (t) and physiological signal Θ(b(t), r(t)) is non-linear and the nonlinearity complexity can be caused by non-stationary terms, such as illuminance variation, head motion, facial expression, camera intensity variation, etc. A machine learning model is desired to model the complex relationship between C k (t) and Θ(b(t), r(t)).

B. Dual Attention Network Architecture 1) Overview: Our proposed temporal shift dual attention multitask network architecture is shown in Fig. 2 . We follow the network architecture proposed in [5] , [8] which is a two branch network with motion branch and appearance branch. Motion branch takes N consecutive frame's face ROI difference as input. Appearance branch takes current frame's face ROI as input. Temporal shift module [20] is applied before 2D convolution layers in the motion branch. There are two spatial attention layers that get multiplied to motion branch to select informative spatial features. Since physiological signals are not uniformly distributed on human skin, the soft spatial attention mask which gives higher weights in regions where physiological signal are stronger could improve network accuracy. Different from [5] , [8] , the novel part of our network architecture is that we add channel-wise attention layers to the network architecture. Channel-wise attention is to select discriminative features in channel dimension. Our proposed dual attention network structure can be used to estimate Blood Volume Pulse (BVP) for heart rate estimation or respiratory wave for respiratory rate estimation individually. The network can also estimate both BVP and respiratory wave in a multitask learning fashion.

2) Spatial Attention: Spatial soft attention mask is generated before average pooling layers using a 1 × 1 convolution filter. Then the attention mask is multiplied with motion branch feature map via element-wise multiplication. The masked feature map Z k passed to next layer is calculated in Eq. 2,

where σ(·) is sigmoid activation function, ω k is the 1 × 1 convolution kernel, b k is the bias, X k m is the motion branch feature map, X k a is the appearance branch feature map, is element-wise multiplication, k is the layer index, H k and W k are height and width of feature map.

3) Channel-wise Attention: We insert three channel-wise attention layers into the network: before the 2D convolution extracting attention masks in appearance branch and before final average pooling. By inserting channel-wise attention layer in appearance branch, a better facial attention mask can be generated. By inserting channel-wise attention layer before the final average pooling, it helps the network emphasize informative features and suppress less useful ones. Following Efficient Channel Attention (ECA) [27] , in channel attention module, after channel-wise global average pooling, 1D convolution is performed followed by a Sigmoid function to learn channel attention. Please refer [27] for more details about channel-wise attention module.

The multitask learning loss is the summation of heart rate pulse waveform MSE loss and respiratory rate waveform MSE loss, which is defined in Eq. 3,

where T is the time window, p(t) and r(t) are time variant ground truth pulse waveform sequence and respiratory waveform sequence respectively, p(t) and r(t) are predicted pulse waveform and respiratory waveform, α, β are empirical parameters to balance pulse waveform loss, respiratory waveform loss and correlation loss. We set α = β = 1 in our experiment.

The output of our of neural network model is the pulse waveform sequence and respiratory waveform sequence. To extract the heart rate and respiratory rate in beats per minute, a Butterworth bandpass filter was applied to the model outputs with cut-off frequencies of 0.67 and 4 Hz for heart rate, and 0.08 and 0.50 Hz for respiratory rate. The filtered signals were then divided into 10-second windows to apply the Fourier transform to get the dominant frequencies as the heart rate and respiratory rate. 

We compare our methods with two approaches for heart rate measurement and respiratory rate measurement: 2D convolutional attention network (2D-CAN) [5] and temporal shift convolutional attention network (TS-CAN) [8] .

We run our experiments using the following datasets: COHFACE [28] : The dataset contains RGB videos of faces, synchronized with heart-rate and breathing-rate of the recorded subjects. The video sequences have been recorded with a Logitech HD C525 at a resolution of 640x480 pixels and a frame-rate of 20Hz. Blood volume pulse (BVP) and respiratory rate waveforms are also recorded and synchronized with video timestamp. The dataset includes 160 one-minute long video sequences of 40 subjects (12 females and 28 males). There are 4 videos from every client: 2 videos with good conditions, another 2 videos with more natural conditions. Natural condition videos include ceiling lights OFF and half opened blinds to introduce lightening variations. We follow the training and testing data split protocol provided by COHFACE dataset in our experiment.

Our dataset: The dataset contains RGB videos of face, ground truth heart rate pulse wave recorded from fingertip pulse oximeter. The dataset includes one-minute long video sequence of 15 subjects. Each subject gives 1 session with good condition: similar distance to camera, face static to camera, good illumination condition, and another session with natural condition: subjects seating in various distance to camera ranging from 0.5 meter up to 2 meter, large head motion in roll pitch yaw directions range from 0 to 90 degrees, illuminace variations by adjustable light source.

We use the COHFACE [28] and our dataset to evaluate our proposed convolutional neural network architecture. We use OpenCV face detector to get face crop and resize it to 72x72 for motion map and appearance map generation. Previous work [5] , [8] resized the face crop to 32x32, however, we found that higher resolution gives better accuracy. Motion map is current frame and previous frame's face crop subtraction. Appearance map is the current frame's face crop. Both motion map and appearance map are normalized in the video.

To have a fair comparison with previous work 2D-CAN [5] and TS-CAN [8] , we use the same backbone architecture to train 2D-CAN, TS-CAN and our TS-DAN. We also use the same preprocessed motion map and appearance map and the same post processing step to train and evaluate different networks. We evaluate our model in various aspects. First we did ablation on our proposed TS-DAN network architecture, i.e., we compare accuracy by applying ECA [27] module in different layers to show how channel-wise attention module could help improve network accuracy. Attention map is visualized and compared with 2D-CAN and TS-CAN. Second, we compare our TS-DAN model accuracy with TS-CAN in single task heart rate and respiratory rate learning to demonstrate our TS-DAN model's accuracy improvement by leveraging both spatial attention and channel-wise attention. Third, we evaluate our proposed TS-DAN network in multitask learning for joint heart rate and respiratory rate estimation. Finally, we evaluate our model's cross-datasets generalization ability, i.e., training our model and previous work model 2D-CAN and TS-CAN using one dataset and test model accuracy on another dataset which collected in different environment settings.

The evaluation metrics were computed over all windows of all the test videos in a dataset, we use the following three metrics:

• Mean Absolute Error (MAE): The average absolute error between ground truth heart/respiratory rate and predicted heart/respiratory rate in beats per minute. • Signal to Noise Ratio (SNR): We calculate blood volume pulse and respiration SNR according to the method proposed by De Haan et al. [13] . The SNR is calculated in the frequency domain as the ratio between the energy around the first two harmonics and remaining frequencies within heart rate and respiratory rate frequency range. SNR captures the quality of predicted heart rate and respiratory rate. SNR < 0 indicates predicted heart/respiratory rate is not reliable since signal energy is less than noise energy. • Availability: We compute the percentage of evaluation window with SNR ≥ 0 as availability. This metric captures percentage of the time the system is able to predict high quality heart/respiratory rate.

The ablation study on our proposed dual attention network is shown in Table I . We implemented and experimented with multiple combinations of ECA module, TS module, and 2D-CAN model. Model 2D-CAN+1ECA which adds one ECA layer before last average pooling to 2D-CAN already outperforms 2D-CAN even without TS module which demonstrates that ECA module helps select more informative features to improve model accuracy. TS-CAN+1ECA adds TSM module in motion branch (TS module location is shown in Fig. 2 ) further improves model accuracy when compared to 2D-CAN+1ECA. TS-CAN+3ECA achieves best accuracy which is a combination of ECA, TSM, and 2D-CAN model where ECA gets applied twice in the appearance branch before each spatial attention module and another ECA gets applied once before the last average pooling. The result shows that ECA is able to put higher weights for the more informative channels in the appearance branch which helps the model understand the differences between the background and the human face. Fig. 3 compares the first and second attention maps in 2D-CAN, TS-CAN and our proposed TS-DAN with 3 ECA layers. The first attention map is after the second convolution layer and the second attention map is after the fourth convolution layer as shown in Fig. 2 . Comparing the first attention map, 2D-CAN model shows higher weights only on a smaller face skin region, (i.e.), the subject's left cheek and left forehead shows lower weights in soft attention map. TS-CAN model captures larger skin region however the boundary is blurry and shows false positive higher weights in eyelid region. The first attention map from our TS-DAN model clearly shows much better spatial localization on the skin region where physiological signal is stronger (forehead and cheeks). Furthermore, comparing the second attention map, attention map from TS-DAN shows larger contrast between face region and background with better boundary localization which indicates a better spatial and channel-wise feature extraction. TS-DAN gives lower weight on background than both 2D-CAN and TS-CAN. In summary, the attention map generated by TS-DAN gives higher weights on skin with better localization and gives lower weights on background, which helps improve network robustness and reduces background noise. Table. II compares heart rate and respiratory rate estimation accuracy between TS-CAN and TS-DAN in single task network. TS-DAN gives smaller MAE for full COHFACE dataset evaluation in both heart rate and respiratory rate estimation. Table. III compares heart rate and respiratory rate estimation accuracy between TS-CAN and TS-DAN in multitask learning. MT-TS-DAN greatly decreases MAE, increases availability and SNR in heart rate estimation compared to MT-TS-CAN under full, clean, natural data evaluation. MT-TS-DAN achieves higher availability in respiratory rate estimation, however, slightly decreases respiratory rate MAE.

To test whether our model can work well in different environment, such as different lightening conditions, different levels of head motions, we use our collected dataset to evaluate model generalization accuracy in heart rate estimation. Model trained with clean condition videos is tested using natural condition videos and vice versa. The model generalisation ablation study is shown in Table. IV. Our proposed model achieves much lower MAE in both generalization test.

Temporal shift dual attention network (TS-DAN) was proposed to estimate heart rate and respiratory rate from a RGB video. We integrated both spatial attention and channel-wise attention to convolutional neural network architecture. The network can be applied in a single task or multitask learning fashion. It was demonstrated by experimental results that the proposed TS-DAN system offers the best accuracy on two benchmark datasets. 

Nvidia Clara Guardian: Edge AI for Smart Hospitals

Eulerian video magnification for revealing subtle changes in the world

Detecting pulse from head motions in video

Camera-based system for contactless monitoring of respiration

Deepphys: Video-based physiological measurement using convolutional attention networks

Video-based respiration monitoring with automatic region of interest detection

A novel algorithm for remote photoplethysmography: Spatial subspace rotation

Multi-task temporal shift attention networks for on-device contactless vitals measurement

Algorithmic principles of remote ppg

Remote ppg based vital sign measurement using adaptive facial regions

Robust heart rate from fitness videos

Measuring pulse rate with a webcam-a non-contact method for evaluating cardiac activity

Robust pulse rate from chrominance-based rppg

Discriminative signatures for remote-ppg

Modified rgb cameras for infrared remote-ppg

Temporal shift module for efficient video understanding. 2019 ieee

Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation

Effects of video encoding on camera-based heart rate estimation

Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement

Tsm: Temporal shift module for efficient video understanding

Skeleton-based gesture recognition using several fully connected layers with path signature features and temporal transformer module

Action recognition using multi-scale temporal shift module and temporal feature difference extraction based on 2d cnn

Spatial transformer networks

Squeeze-and-excitation networks

Dual attention network for scene segmentation

Fully convolutional networks for semantic segmentation

Eca-net: Efficient channel attention for deep convolutional neural networks, 2020 ieee

A reproducible study on remote heart rate measurement