key: cord-0227333-g96ns3i3
authors: Yang, Ze; Wang, Haofei; Lu, Feng
title: Assessment of Deep Learning-based Heart Rate Estimation using Remote Photoplethysmography under Different Illuminations
date: 2021-07-28
journal: nan
DOI: nan
sha: da9a697914e4ad558ebe861d2c6f8aba5467757b
doc_id: 227333
cord_uid: g96ns3i3

Remote photoplethysmography (rPPG) monitors heart rate without requiring physical contact, which allows for a wide variety of applications. Deep learning-based rPPG have demonstrated superior performance over the traditional approaches in controlled context. However, the lighting situation in indoor space is typically complex, with uneven light distribution and frequent variations in illumination. It lacks a fair comparison of different methods under different illuminations using the same dataset. In this paper, we present a public dataset, namely the BH-rPPG dataset, which contains data from thirty five subjects under three illuminations: low, medium, and high illumination. We also provide the ground truth heart rate measured by an oximeter. We evaluate the performance of three deep learning-based methods (Deepphys, rPPGNet, and Physnet) to that of four traditional methods (CHROM, GREEN, ICA, and POS) using two public datasets: the UBFC-rPPG dataset and the BH-rPPG dataset. The experimental results demonstrate that traditional methods are generally more resistant to fluctuating illuminations. We found that the Physnet achieves lowest mean absolute error (MAE) among deep learning-based method under medium illumination, whereas the CHROM achieves 1.04 beats per minute (BPM), outperforming the Physnet by 80$%$. Additionally, we investigate potential methods for improving performance of deep learning-based methods. We find that brightness augmentation make model more robust to variation illumination. These findings suggest that while developing deep learning-based heart rate estimation algorithms, illumination variation should be taken into account. This work serves as a benchmark for rPPG performance evaluation and it opens a pathway for future investigation into deep learning-based rPPG under illumination variations.

H EART rate (HR) is an important physiological indicator for both physical and mental health. HR monitoring has been used in many applications, such as state monitoring [1] , driver fatigue detection [2] , face anti-spoofing [3] , etc. Traditional HR monitoring methods rely on electrocardiograph (ECG) and contact photoplethysmography (PPG) sensors. However, wearing such contact devices is uncomfortable and often interferes with daily activities. With the development of computer vision algorithm, remote HR measurement based on remote photoplethysmograply (rPPG) has been proposed [4] - [7] . While rPPG offers the potential for contactless and continuous measurement of HR using low-cost web cameras, the system performance is still limited by many factors, such as lighting variations and head movements [8] .

Lighting conditions is critical for rPPG since the quality of rPPG signal is determined by the light ingested into the skin. However, most existing studies only explored the laboratory condition with good lighting condition. [4] , [5] , [9] - [12] . Insufficient illumination may lead to low amplitude of rPPG signal, due to the fact that the energy of light is too vulnerable to penetrate into skin surface. Moreover, most traditional methods are required to find the specific skin area [4] , [12] and yet the low contrast on an image makes it difficult to obtain correct region of interest (ROI). On the contrary, high intensity of light lead to image clipping on the skin surface [13] , [14] . Besides, the light distribution also has significant impact on rPPG. The conventional methods usually select the whole face as ROI and assumes the same contribution of rPPG signal at different parts of the face. This assumption may not hold in the real-world applications, especially in indoor space, the light distribution and intensity are different due to the relative position between the subject and the light source.

The traditional approaches use different methods to extract rPPG signal, which can be mainly categorized into two types: 1) skin reflection model-based approach [4] , [12] and 2) blind source separation-based approach [11] , [15] . Unfortunately, these models seldom take the lighting conditions into consideration. Po et al. proposed an adaptive ROI approach-based on the quality of rPPG signal acquired from sub-region of face to tackle the uneven light distribution challenge. However, the system performance under different lighting intensities has not yet been evaluated.

Deep learning-based approaches, such as Convolutional Neural Networks (CNN), have been used to estimate HR using rPPG signal.Špetliket al. [7] proposed to use 2D-CNN as backbone to directly estimate single value of HR at the early stage of rPPG, but it neglects the temporal information between frames. Physnet [16] and rPPGNet [9] can detect atrial fibrillation (AF) by generating more precise rPPG signals. The Physnet employs deep spatio-temporal networks with 3D-CNN as backbone and builds an end-to-end model. The rPPGNet treats the HR estimation as multi-task learning task (HR estimation task and skin segmentation task), which also uses the 3D-CNN as backbone with skin segmentation branch. Although these works achieve the superior performance, the quality of rPPG signal generated under different illumination is still unknown. The Deepphys [5] combines the theory of skin reflection model and attention mechanism, which adopt 2D-CNN as backbone. It outperforms the traditional methods by using attention mechanism, which takes into account the rPPG intensity distribution in different parts of the face. It is wellknown that the performance of deep learning model is sensitive to illumination. While the filters in CNN model learn the specific pattern capturing different levels of visual information in most of computer vision tasks, the lighting affects the quality of the rPPG signal itself in HR estimation task. Wang et al. [17] conducted a series of experiments to show that CNN uses color variation information in blood absorption to estimate HR. However, they did not validate the performance of deep learning models under different illuminations.

For data-driven approaches, the quality of training data determines the system performance. Most of the previous studies evaluated the performance on different datasets, which makes it unfair to compare the system performance. For example, the Physnet [16] is trained on OBF [18] dataset, the Deepphys is trained on RGB Video I [5] dataset, both of the datasets are not publicly accessible.

To evaluate the robustness of different methods in real-world applications, here we presented a public dataset, i.e., BH-rPPG dataset (BH stands for BeiHang University), which consists of three lighting intensities with uneven light distribution on the face (see first row in Fig.1 ).

In summary, the primary contributions of this paper are three-fold: 1) We present the BH-rPPG, a public dataset for rPPG-based heart rate estimation. The BH-rPPG consists of twelve subjects' data under three different illuminations. The link can be found at https://github.com/yangze68/BH-rPPG. 2) We systematically evaluated the robustness to illumination variation of typical methods for rPPG-based heart rate estimation, including four traditional methods [4] , [10] - [12] ) and three deep learning-based methods [5] , [9] , [16] ). 3) Our experimental results suggest that although the deep learning-based methods achieve superior performance under normal illumination, they are less resistant to illumination variations compared with traditional methods. We also explore potential methods for performance improvement using the deep learning-based model. Results show that brightness augmentation is efficient in improving performance in different lighting conditions. These findings draw attention to designing more robust deep learningbased methods for remote heart rate estimation.

The remainder of this paper is organized as follows. Section II summarizes the related works of the heart rate estimation methods using rPPG and the lighting conditions in different applications. Section III describes the datasets including the UBFC-rPPG and the proposed BH-rPPG. Section IV describes the experimental setup including methods, experimental protocols, performance evaluation metrics and potential method for performance improvement. Section V presents the experimental results. Section VI gives a discussion of the findings. Finally, Section VII concludes this paper and outlines the future work.

We first review the existing methods of heart rate estimation using rPPG, including both traditional approaches and deep learning-based approaches. Then we summarize the different lighting conditions in various rPPG applications.

Heart rate can be remotely monitored through two channels: ballistocardiographic (BCG) [19] , [20] and remote photoplethysmography (rPPG) [4] , [5] , [9] - [12] . The BCG-based methods use a camera to capture subtle movements induced by the periodic blood ejected into the vessels with each heartbeat. The BCG-based non-contact pulse measurements is achieved by blind source separation of the head movements in video. However, BCG-based methods are usually limited by user's head movement since the faint movements trace induced by cardiac activity is hard to capture during large scale head movements. On the contrary, rPPG-based methods register the pulse induced by subtle color variations of human skin [10] , [21] . This measurement is based on the fact that the pulsatile blood propagating in the human cardiovascular system changes the blood volume in skin tissue. The oxygenated blood circulation leads to fluctuations in the amount of hemoglobin molecules and proteins thereby causing variations in the optical absorption and scattering across the light spectrum [21] .

The rPPG-based methods can be categorized into two types, 1) the traditional methods that rely on optical models, e.g., the Lambert-Beer law and the Shafer's dichromatic reflection model; and 2) the deep learning-based methods that rely on the appearance of the face.

The optical models that used in the traditional methods are grounded by the optical properties of the skin under ambient illumination. Different color channels contain different quality of rPPG signal. The green channel has been used in early rPPG research, since it generates strongest rPPG signal [10] .

Previous studies have shown that the cardiac activity causes variation in the optical absorption across the light spectrum [22] , using this characteristic, CHROM [4] and POS [12] project RGB on different plane by re-weighting and linear combination of color channel. Blind signal separation has also been proposed, which considers the temporal trace of PPG that can be retrieved from independent or uncorrelated signal sources under certain assumptions. Independent component analysis has been used in multi-signal sources obtained by different approaches such as the same region of color channels [11] and patch level of regions of interest (ROI) [23] .

Many deep learning-based heart rate estimation methods have been proposed recently. Chen and McDuff [5] present Deepphys, which employs two parallel branches of CNN to extract rPPG feature: the motion branch and the appearance branch. The motion branch is fed with normalized framedifferences to cancel motion effect on rPPG signal, while the appearance branch uses attention mechanism that enables the network to focus on the area of skin. Other researchers investigated different network architectures for better estimation. Yu et al. [16] developed an end-to-end network to estimate heart rate using compressed videos. They used a three dimensional CNN to capture temporal information and an extra skin segmentation branch to regress PPG signal. Niu et al. [24] proposed to directly estimate heart rate from a spatiotemporal network.Špetlik et al. [7] introduce two-step network for feature extraction and heart rate estimation. Qiu et al. [25] integrated the signal magnified technique named Eulerian video magnication [26] with convolutional neural network to estimate heart rate. Lee et al. [27] proposed a transductive meta-learner to adapt model to different domains. In addition to meta-learner, Niu et al. [28] introduced a cross-verified scheme to purify the feature constructed with spatio-temporal map. Although deep learning methods yield promising results, their performance under different illumination remains to be explored.

The rPPG has been used in many applications, such as state monitoring at home or driver fatigue detection on the car, where the lighting conditions can be very different.

In indoor environment, depending on the relative position between the person and light source, it may suffer from insufficient and uneven lighting conditions. In the application of state monitoring, the algorithm should adapt to the light variation. For example, Sun et al. [29] continuously monitored discomforts of infants over a long period, which requires the algorithm to work in complex light conditions. To estimate heart rate in extremely low light condition, Lin et al. [30] proposed to use infrared spectrum to extract features. In addition, due to the COVID-19, the demand for non-contact healthcare techniques is dramatically increasingly [31] . However, the investigation of algorithm performance under complex lighting conditions is relatively rare.

In outdoor environment, the heart rate estimation becomes more challenging since the illumination changes dramatically [32] . In the driver fatigue detection task, the illumination is quite distinct from the laboratory. In other applications such as face anti-spoofing and online payment system, rPPG technology could be used to perform liveness detection, which prevents using a fake face to circumvent the system and to gain unauthorized access [3] . The Deepfake video can also be distinguishable by rPPG [33] , whereas the lighting conditions is far more complicated due to wide range of usage.

In summary, although rPPG has been deployed in many applications, the non-ideal lighting conditions degrade the system performance. Thus, it is necessary to conduct a systematic comparison between different approaches and to evaluate the robustness of these approaches under different lighting conditions.

In this section, we first briefly introduce the public dataset used in the experiment. Then, we present the details of BH-rPPG dataset under three different lighting conditions: low/medium/high illumination.

Most public datasets are collected under controlled environment, such as UBFC-rPPG [34] , VIPL [35] , PURE [36] and MAHNOB [37] . To the best of our knowledge, no public dataset examines the effect of illumination intensities. Although COHFACE [38] dataset was collected under controlled lighting and natural lighting, the lighting intensity remains the same. Here we choose the UBFC-rPPG [34] dataset as the training set for deep learning method.

The University Bourgogne Franche-Comté Remote Pho-toPlethysmoGraphy dataset (i.e., the UBFC-rPPG dataset) consists of two scenarios, here we only use the part that subjects play a time-sensitive mathematical game. This is because it is a real life setting which includes natural head movements. Subjects' heart rate changes over time as induced by mathematical games. The dataset includes 42 one-minute videos from different subjects. The video is recorded using a low-cost webcam (Logitech C920 HD Pro) at 30fps with a resolution of 640x480 in uncompressed 8-bit RGB format. A CMS50E transmissive pulse oximeter was used to obtain the ground truth PPG data comprising of the PPG waveform as well as the PPG heart rates.

B. BH-rPPG dataset 1) Apparatus setup: Fig.2 presents the experimental setup. There are two light sources (a ceiling lamp and a table lamp) that create different lighting conditions. An oximeter (CON-TEC CMS50E) was used to obtain the ground truth PPG data. A webcam (Logitech HD pro webcam C310 color camera) recorded the video data synchronized with the oximeter. The resolution for video is 640 × 480. The web camera actual frame rate is 30 fps, but under low lighting intensity, the actual frame rate is about 20 fps. The subject sits 1 meter away from the camera. Since our study focuses on illumination variations instead of head movements, subjects are asked to keep their head stationary during the data collection. The reason that we used two lamps in the experiment is that this is more similar to the settings in daily living. We collected data under three lighting conditions, as shown in Table I . With the ceiling lamp always on in three conditions, we change the mode of table lamp to modulate the illumination. Fig. 4 shows some sample images under different illuminations. 2) Data collection procedure: We recruited 35 healthy subjects (16 males and 19 females, with a gender ratio of 0.84) on campus, with a mean age of 24, SD of 2.31. For each subject, we took three 30-second videos under three lighting conditions. The left part of Fig.4 shows the average lighting intensity under three conditions. The illuminations of low, medium and high level are 8.0, 42.4, and 104.0 lux, respectively.

3) Dataset statistics: The BH-rPPG dataset consists of 105 videos from 35 participants(refer to Table II .) To quantitatively demonstrate the variations in lighting in the BH-rPPG dataset, we also calculated the bar charts of the mean value of videos in different lighting conditions. Due to the fact that we only care about the light conditions on face area, we compute the mean pixel within the region of the bounding box of face (as shown in Fig. 3 ). 

In this section, we introduce the methods compared in this paper. Next, we describe our experimental protocol and the performance evaluation metrics. Finally, to gain further understanding of the illumination effect on HR estimation accuracy, we investigate several potential methods, such as illumination compensation techniques and training strategies, to promote the performance of HR estimation with current deep learning-based models.

We evaluate both traditional methods and deep learningbased methods. To make a fair comparison and eliminate the difference during preprocessing, we used Viola-Jones face detector to extract face area for reducing noise from the background. We employed Kanade-Lucas-Tomasi (KLT) [39] algorithm to track the location of the face region to avoid head rigid movements. The processed video frames are used as input for different algorithms.

1) Traditional methods: We compared four representative methods: GREEN [10] , CHROM [4] , POS [12] and ICA [11] . For implementation, we used the open source toolbox iPhys [6] . The basic workflow of the traditional method is shown in Fig. 5 . First, we detect and track the bounding box of the face using the KLT algorithm [39] . Then, the skin area is detected, and the eyes and mouth are removed since they often bring the noise for non-rigid movements during blink and speech. Next, the pulse signal is extracted by spatial pooling and all mean values for each frame are concatenated as raw pulse

where L is the number of frames in the video. After that, the varying part induced by heart rate can be obtained by a bandpass filter and detrending. Finally, we apply different methods to raw pulse trace, transform the signal into the frequency domain using Fast Fourious Transform (FFT), and find the peaks to estimate the heart rate.

2) Deep learning-based methods: We evaluate the performance of three typical deep learning-based methods: Deepphys [5] , rPPGNet [9] , and Physnet [16] . The Deepphys is a two-dimensional CNN-based network that uses an attention mechanism to learn the skin map. The rPPGNet is a threedimensional CNN-based network that uses soft attention to make the model focus on the skin area. The Physnet uses temporal encoder-decoder structure for rPPG task, which is applied in the action recognition task. The basic procedure of deep learning-based method can be formulated as below.

where [y 1 , y 2 , . . . , y T ] are the ground truths collected by finger oximeter, [x 1 , x 2 , . . . , x T ] are the frames sampled from original video. In Deepphys, T is set to 2, meaning that two consecutive frames are used to compute normalized difference frame and outputsŷ. First, a CNN backbone Φ extracts spatial-temporal feature with parameter θ, then Ω is used for channel aggregation with parameter w. The estimated PPG is [ŷ 1 ,ŷ 2 , . . . ,ŷ T ]. Deepphys uses 2D CNN as Φ to extract spatial information and uses soft attention to assign different weights on skin regions. Physnet and rPPGNet use 3D CNN as Φ to model temporal signal and take into account the correlation between ground truth and output. We have re-implemented the Deepphys algorithm since the author did not release the source code, for rPPGnet and Physnet, we directly used the open-source model.

On the one hand, we would like to compare the performance of different deep learning methods trained with the same protocol, i.e., trained and evaluated using the same dataset. On the other hand, our goal is to evaluate the performance of deep learning-based method under different lighting conditions. We provide a comprehensive performance comparison between different methods. We also explore potential approaches for increasing the performance of deep learning-based methods.

1) Performance comparison under the same training protocol: We utilize the UBFC-rPPG dataset to train and test different deep learning-based methods. Specifically, we randomly divide 42 videos into training-set (37 videos) and test-set (5 videos). Since each video corresponds to one subject, the task is subject-independent.

For traditional methods, we only evaluated performance on the test-set for a fair comparison with the deep learning-based method.

For Deepphys [5] , we reproduced the model and trained with the same learning rate and batch size. For Physnet [16] and rPPGnet [9] , we adopted the frame length of 128, 64 as single clip which are sampled from original video respectively.

2) Performance comparison under different illuminations: We evaluated the traditional methods and the deep learningbased methods trained with UBFC-rPPG dataset. For traditional methods, we followed the settings of iPhys [6] , except that the skin pixel value range changed according to different lighting conditions. Therefore, the frequency range of the raw signal between 0.7 to 2.5 Hz was extracted. For deep learningbased methods, we used the best model trained by the protocol mentioned in Section II to cross-test the BH-rPPG dataset. We used the average HR for the evaluation protocol of traditional methods.

Additionally, we investigate various strategies for enhancing the performance of deep learning-based methods. For models to better predict HR in videos with low lighting conditions, a natural way is to enhance each dark video frame by illumination compensation techniques to make the frame visually more transparent. Another direction for improving model accuracy is data augmentation, which is ubiquitous in image classification when labeled data is scarce. However, the lighting variations augmentation remain under-explored. Therefore, we tackle this problem in two stages:(1) Applying image enhancement after training and (2) Generating more lighting variations data during training. We will describe the details of implementation as below.

1) Image enhancement methods: We use the following three typical image enhancement algorithms: a) Histogram Equalization (HE), b) Gamma Correction (GC) and c) Zero-Reference Deep Curve Estimation (ZERO-DCE [40] ). The HE is a common approach to make the frame more contrast. Since the complex illumination leads lighting distribution on the face to be imbalanced. In this case, we apply the HE to V channel in HSV color space, which is more related to brightness. Then we transform back to RGB color space for model inference. The GC is a traditional image enhancement technique to improve image quality. For videos of low-light conditions, we set gamma values as 2.5. Furthermore, we set gamma values as 0.8 for videos with high lighting intensity. ZERO-DCE [40] is a more recent low-light image enhancement method through estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. We directly use the pre-training model to further enhance our videos in low-light conditions.

2) Video data augmentations: We apply the same brightness data augmentation for a single clip rather than individually applying brightness augmentation on each frame since random transformation for a single frame may break the original pixel distribution. For brightness augmentation, we modified the function provided in PyTorch [41] and randomly changed the brightness parameter in the range of [0.5, 1.5].

We used four evaluation metrics: the mean absolute error (MAE), the root mean square error (RMSE), Signal-to-Noise Ratio (SNR), and Bland-Altman plot [8] , [42] .

1) Mean absolute error:

where n is the total number of samples, HR (i) predict is the HR estimated by rPPG from ith samples, HR (i) gt is the ground-truth HR from ith samples.

2) Root mean square error:

3) Signal-to-noise ratio: The SNR computes the ratio of the energy around the fundamental frequency plus the first harmonic of the pulse signal and the remaining energy contained in the spectrum.

Here we followed the same definition in [4] , whereŜ(f ) is the spectrum of the pulse signal, S, f is the frequency in beats per minute, and U t (f ) is a binary template window. 4) Bland-Altman plot: This plot demonstrates the consistency between two signals. The differences between the heart rate estimated by rPPG algorithm and the ground truth are plotted against the system average. We show the mean, standard deviation (SD), 95% agreement limits (±1.96SD) of the difference. Table III demonstrates the performance of traditional methods and deep learning-based methods. Note that the deep learning-based methods are trained and tested under the same protocol using the UBFC-rPPG dataset. The Physnet [16] achieves the best within UBFC-rPPG. In comparison to the traditional methods, the deep learning-based methods perform admirably. These results suggest that deep learning-based methods indeed demonstrate superior performance. Fig. 6 , we observed that the CHROM in medium lighting conditions achieved the best results among all methods at 1.04 BPM. And the medium-light and highlight show similar good results. The second best method is performed by POS in high-light. Besides the result in mediumlight, which shows 9.96 BPM in MAE, the performance achieved by POS is under 2 BPM. As for deep learningbased method, Physnet reached an MAE of 5.37 BPM in medium-light, which is the best performance among deep learning approaches. However, it still falls behind most of the traditional methods, except for ICA of low-light and POS of medium-light, which performs poorly in medium and low lighting conditions, respectively. The results reveal that traditional methods are more robust to different illumination scenarios. In the RMSE plots, the lowest performance in terms of RMSE is achieved by CHROM, which outperforms the best model for deep learning-based 80%.

From the SNR plot in Fig. 6 , we found that the conventional methods achieved positive values in high and medium lighting conditions. However, only the Deepphys achieve positive SNR in medium and high light conditions. It is notable that deep learning-based methods are significantly inferior to traditional methods in terms of generalization ability, and different algorithms perform inconsistently in different illuminations. Deepphys works better in high-light conditions. Both rPPGNet and Physnet are more effective in medium-light conditions. This is possibly because that the medium-light is more similar to UBFC-rPPG. Fig. 7 depicts the ground-truth HR and the estimated HR under three different lighting situations, which is to illustrates the estimation consistency of various methods, in which each scatter point represents the estimation error of HR. Traditional methods are much more consistent with ground-truth HR than deep learning-based methods, i.e., the distribution of scatter points is very close to the x-axis. Both of rPPGNet and Deepphys have errors at above 30 BPM, indicating the fails of HR prediction. As for Physnet, the errors in low lighting conditions stabilize at a low level within the range of [-10, 10] ; only a small part of outlier appears at around 20 BPM. However, the traditional methods have more consistent results.

C. Results of applying image enhancement methods to BH-rPPG Table V demonstrates that HE method has a negative effect on all three lighting conditions videos, except for rPPGNet and Physnet, positive gain to SNR only. Similar to HE, Zero-DCE contributes little to the performance of improvement. One possible reason might be that temporal consistency is significant for HR estimation and illumination compensation techniques break the coherency of pixel variation between frames. As for GC-H, we see that Physnet achieves 9.83%, 6.87% and 4.81% improvement in MAE, RMSE, SNR, respectively. It might be because the GC successfully aligns the brightness of BH-rPPG in high-light to UBFC-rPPG dataset.

In Table VI , the brightness jitter in training stage has great positive effect on rPPGNet and Physnet. Compared with the non-augmentation approaches, the Physnet with data augmentation achieves the optimal performance in high-light condition, with an MAE gain of 49.19%, while the rPPGNet with data augmentation obtains an MAE improvement of 10.87% and SNR improvement of 30.05% in medium-light condition.

We notice that the Deepphys show poor improvements for data augmentation, partly because the 2D-CNN is hard to learn the illumination variation. Therefore, the results in Table VI exhibit that the brightness augmentation in the training stage is indeed more efficient than image enhancement methods to improve the performance.

To better understand the performance difference, we visualize the region-of-interest (ROI) in different methods. Fig. 8 shows the original frame, ground truth, preprocessing results of traditional method, and attention weights of intermediate steps in rPPGNet.

The POS method is selected as producing the ground truth map with the definition of SNR mentioned in Section III.E. Since the traditional method can accurately depict the actual distribution of the rPPG signal on the face. We can see from the original frame and ground truth that the brighter the region of the face, the higher the SNR of the rPPG signal. The varying light intensity changes the distribution of rPPG signal, which echoes the findings in [12] and [13] . However, the attention weights learned by deep learning methods demonstrate that the neural network focuses on back-ground and skin area that is irrelevant to light. One possibility is that there is a domain gap between the training data (UBFC-rPPG dataset) and test data (BH-rPPG dataset). Because the skin tone and good lighting conditions in the training set of UBFC-rPPG is different from the test set of BH-rPPG. Lee et al. [43] proposed a meta-learning framework to update model weight, which may help model to adapt to different application situations. In addition, the skin branch in rPPGNet is a series of learnable weights optimized as the ground truth of binary skin mask which is generated by [44] . When the lighting intensity, skin tone or environment changes, it is natural for the skin branch to produce the incorrect skin mask. We believe that finding the ROI branch is significant to deep learningbased rPPG task. While the varying lighting conditions have significant effect on the performance of deep learning-based methods.

In contrast to deep learning-based rPPG, traditional methods detect skin regions in the preprocessing steps which has been visualized in traditional columns of Fig.8 . Although the lighting intensity distribution on face is uneven, the skin detection algorithm successfully locates the correct skin area that contains rPPG signal induced by cardiac activity. This may explain why traditional approaches perform better than deep learning-based methods under different illuminations. Furthermore, the average pooling for whole region can retrieve the information for brighter regions that contains more rPPG signal and smooth the darker regions where contribute less to rPPG. Additionally, due to the spatial redundancy of rPPG [45] , some studies partition the face into distinct grids allows that each grid for different grids may be beneficial under varying lighting conditions.

In this paper, we investigated the performance of deep learning-based rPPG under varying lighting conditions and used conventional methods as a baseline. From Table III and  Table IV , we found that deep learning-based methods perform well within UBFC-rPPG dataset and show poor performance in BH-rPPG dataset. Especially for Physnet, which shows the best performance within UBFC-rPPG, although perform the best results among deep learning-based methods, it still do not perform as accuracy as traditional method, such that CHROM. For this case, the appearance of background, subject's skin tone, and lighting conditions greatly impact the the deep learning-based method. Therefore, the ambient light needs to be chosen carefully.

As the conventional method, it performs poorly in the UBFC-rPPG dataset but well in BH-rPPG dataset. One possible reason for low accuracy in UBFC-rPPG is the motion effect on rPPG signal. The deep learning-based method learns the relationship between pixels values and HR by a large number of non-linear mapping, and the linear combination of color channels in the conventional method may not hold in the complicated environment with large head movements. However, Due to the robustness of the conventional methods under varying lighting conditions, conventional methods are more suitable in scenarios with less head movement, such as liveness detection in the payment systems.

We examined how brightness augmentation and image enhancement techniques improve the model's performance in various lighting conditions. According to the results presented above, we can see that using image enhancement techniques cannot directly improve the HR estimation accuracy of deep learning-based methods. Despite these techniques effectively improving image contrast and brightness, the accuracy gains achieved by tested models are quite limited. Some image enhancement methods even degrade estimated accuracy by disrupting frame temporal consistency and pixel value distribution. Further research is needed on image enhancement techniques for rPPG tasks. From the results shown in Table VI , integrating with brightness augmentation leads to promising performance improvement on BH-rPPG, which is a powerful tool to force deep learning-based model to learn the feature invariance to lighting intensity.

In general, HR estimation tasks using rPPG are similar to other video-based learning tasks, such as action recognition, where the context of human anatomy is modeled between frames. The distinction is that rPPG modeled the color variation associated with HR. However, poor weather conditions introduce photometric differences that lead to degraded performance of the action recognition system. This is similar to the poor performance of the deep learning-based rPPG approach under different light intensities. Data augmentation is a straightforward and effective technique to remedy this deficiency in the field of action recognition. Inspired by this, we apply the data augmentation technique to deep learningbased HR estimation tasks. Experimental results demonstrate its effectiveness for rPPG tasks under different illumination conditions.

Compared to the image enhancement methods, the brightness augmentation improves the robustness of deep learning models under different lighting conditions. We argue that it is because the model without augmentation can be easily over-fitted to a specific range of pixel values. The image enhancement disrupted the original color distribution along both spatial and temporal dimensions, which is critical to rPPG signals. The brightness augmentation increased the data diversity and encouraged the model to focus on the features related to HR, which is invariant under different lighting conditions. However, most existing work of data augmentation focuses on single images rather than videos. Designing more useful data augmentation techniques can bridge the illumination robustness gap between deep learning-based methods and traditional methods.

In this work, we compared the performance of different methods for rPPG-based heart rate estimation under three lighting intensities. The results show that conventional methods are more robust to lighting intensities changes and uneven lighting distribution, while the Physnet achieves the best performance among the deep learning-based methods. In the development of deep learning-based method, we should consider the varying lighting conditions, especially the different lighting intensities and uneven lighting distribution. Moreover, we conduct a comparative evaluation of deep learningbased techniques under the same training paradigms. The results show that Physnet achieves the best within UBFC-rPPG dataset and deep learning-based methods are able to capture the temporal variation of skin color with motion. Furthermore, we explore the potential method for performance enhancement to different illuminations. Results show that brightness augmentation is effective in improving performance in different lighting conditions. The findings of this study urge additional research into developing more robust deep learning models to enable the actual application in daily living.

Remote heart rate variability for emotional state monitoring

Visionbased instant measurement system for driver fatigue monitoring

Face Liveness Detection by rPPG Features and Contextual Patch-Based CNN -Proceedings of the 2019 3rd International Conference on Biometric Engineering and Applications

Robust pulse rate from chrominance-based rppg

Deepphys: Video-based physiological measurement using convolutional attention networks

iphys: An open non-contact imagingbased physiological measurement toolbox

Visual heart rate estimation with convolutional neural network

Remote heart rate measurement from face videos under realistic situations

Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement

Remote plethysmographic imaging using ambient light

Advancements in noncontact, multiparameter physiological measurements using a webcam

Algorithmic principles of remote ppg

Robust and automatic remote photoplethysmography

Adaptive gain tuning for robust remote pulse rate monitoring under changing light conditions

Measuring pulse rate with a webcam-a non-contact method for evaluating cardiac activity

Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks

Analysis of cnn-based remote-ppg to understand limitations and sensitivities

The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection

Detecting pulse from head motions in video

Improved pulse detection from head motions using DCT -IEEE Conference Publication -IEEE Xplore

Heart rate measurement based on a time-lapse image

Photoplethysmography and its application in clinical physiological measurement

Robust heart rate measurement from video using select random patches

Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation

Evmcnn: Real-time contactless heart rate estimation from facial video

Eulerian video magnification for revealing subtle changes in the world

Remote Heart Rate Estimation Using a Transductive Meta-learner -SpringerLink

Remote Physiological Measurement via Cross-Verified Feature Disentangling -SpringerLink

Camerabased discomfort detection using multi-channel attention 3d-cnn for hospitalized infants

Using blood volume pulse vector to extract rppg signal in infrared spectrum

Telehealth for global emergencies: Implications for coronavirus disease 2019 (covid-19)

A heart rate monitoring framework for real-world drivers using remote photoplethysmography

Predicting heart rate variations of deepfake videos using neural ode

Unsupervised skin tissue segmentation for remote photoplethysmography -ScienceDirect

Vipl-hr: A multi-modal database for pulse estimation from less-constrained face video

Non-contact video-based pulse rate measurement on a mobile service robot

A multimodal database for affect recognition and implicit tagging

A Reproducible Study on Remote Heart Rate Measurement

Detection and tracking of point

Zeroreference deep curve estimation for low-light image enhancement

Pytorch: An imperative style, high-performance deep learning library

Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions

Meta-rppg: Remote heart rate estimation using a transductive meta-learner

Adaptive skin segmentation via featurebased face detection

Exploiting spatial redundancy of image sensor for motion robust rPPG