key: cord-0447184-jlzhlbgs
authors: Chari, Pradyumna; Kabra, Krish; Karinca, Doruk; Lahiri, Soumyarup; Srivastava, Diplav; Kulkarni, Kimaya; Chen, Tianyuan; Cannesson, Maxime; Jalilian, Laleh; Kadambi, Achuta
title: Diverse R-PPG: Camera-Based Heart Rate Estimation for Diverse Subject Skin-Tones and Scenes
date: 2020-10-24
journal: nan
DOI: nan
sha: 007d910913d22566a9ddb5667aeb56eb53b1aeb4
doc_id: 447184
cord_uid: jlzhlbgs

Remote vital sign monitoring has risen in prominence over recent years, with an acceleration in clinical development and deployment due to the COVID-19 pandemic. Previous work has demonstrated the feasibility of estimating subject heart rate from facial videos. However, all previous methods exhibit biased performance towards darker skin tone subjects. In this paper, we present a novel approach to mitigate biases in photoplethysmography performance on darker skin tones by relying on the statistics of imaging physics. In addition to mitigating skin tone bias, we demonstrate that the proposed method mitigates errors due to lighting changes, shadows, and specular highlights. The proposed method not only improves performance for darker skin tones but is the overall top performer on the entire dataset. We report a performance gain of 0.69 beats per minute over the benchmark for dark skin tones and an overall improvement of 0.47 bpm across all skin tones. Assessment of the proposed method is accomplished through the creation of the first telemedicine-focused smartphone camera-based remote vital signs dataset, named the VITAL dataset. A total of 344 videos (~688 minutes) consisting of 43 subjects with diverse skin tones recorded under three lighting conditions, two activity conditions, and two camera angles, is compiled with corresponding vital sign data including non-invasive continuous blood pressure, heart rate, respiratory rate, and oxygen saturation.

The rapid growth in demand for digital health services and virtual healthcare has necessitated the innovation of remote vital sign technologies for telemedicine applications. During the COVID-19 pandemic, many health systems remotely managed low-risk COVID-19 patients in their homes, monitoring them with home vital sign equipment (i.e. pulse oximeters, thermometers) and telemedicine follow-up visits 1, 2 . Supplying and shipping vital sign monitoring devices to patients is expensive and time consuming, making such a solution nonviable, especially during a global pandemic. Given the ubiquity of smartphones, the desire to capture key vitals unobtrusively and remotely using technology that exploits built-in cameras of smartphones during a telemedicine visit is of paramount importance. Even outside of remote telemedicine, contactless measurement of vitals can improve the efficiency and safety (in context of contagious disease) of clinical triage 3, 4 and intensive care unit (ICU) observation [5] [6] [7] [8] .

Recent methods have proposed using camera-based hardware in combination with computer vision algorithms and artificial intelligence tools to estimate key vitals, such as heart rate (HR) 9-36 , respiratory rate 4, [16] [17] [18] 32, [37] [38] [39] , blood pressure [40] [41] [42] , oxygen saturation [43] [44] [45] , and temperature 4, 20, 46 , in a completely contactless manner. In the specific context of estimating cardiovascular vitals, previous work has primarily focused on techniques that remotely extract a blood volume pulse (BVP) signal and a corresponding HR estimate. Remote photoplethysmography (r-PPG) is one of the most promising techniques used to extract a BVP, primarily from the face. The technique operates by looking for subtle color variations visible on the surface human skin, caused by sub-dermal light absorption fluctuations from changes in blood volume and content. Several r-PPG algorithms have been proposed to extract the BVP signal from videos, including blind source separation (BSS) 12, 24, 25 , model-based 11, 15, 21, 26 , unsupervised data-driven 9, 22 , and supervised deep learning 18, 19, 23, 27, 28, 47 methods. Unfortunately, the performance of existing algorithms fluctuates with changes in illumination condition 30 , subject motion 14, 26, 31 , and skin tone 48 . These key issues suggest that current r-PPG algorithms may be inherently biased: a performance gap exists for certain types of skin tones, subject motions (e.g. speaking) or illumination conditions. Mitigating these biases is challenging. For example, dark skin, which contains higher amounts of melanin, fundamentally reduces the signal to noise ratio of all existing r-PPG algorithms. This is highlighted by the important work of Nowara et al. 48 , which conclusively determined that current r-PPG algorithms have markedly worse performance on darker skin tones. The work also highlights the issue of biased skin tone and gender representation in computer vision datasets, which is especially true for the comparatively small datasets used in r-PPG analyses. This dataset bias further prevents underlying algorithmic biases, such as skin tone bias, from being addressed. Kumar et al. (DistancePPG) 13 first attempted to mitigate skin tone bias using a weighted average of signals from various facial regions-ofinterest (ROI). However, to the best of the authors' knowledge, no work yet has continued development of r-PPG algorithms that tackle the important issue of performance bias on darker skin tones.

In this paper, we provide a novel approach at mitigating bias for skin tone using a larger and more diverse dataset. In contrast to prior approaches, the focus of this work is on understanding the unique physics that underlies inconsistency in r-PPG measurement. Using physics-rooted knowledge and camera noise analysis, we propose modifications to existing r-PPG denoising methods that use a similar weighted ROI philosophy as in DistancePPG. We show that previous methods 33, 36 that extrapolate the weighted ROI philosophy of DistancePPG to more modern pulsatile signal extraction methods actually increase the bias (by providing greater improvement only to lighter skin tones). The proposed method innovates on these weighted ROI denoising approaches by implementing an RGB-space weighting and a novel skin diffuse weighting to the algorithm, leading to a fairer performance gain across all skin tones, as well as robust performance in varying lighting conditions and during subject speech.

To assess the performance of the proposed method with respect to benchmark methods without dataset bias, we collect the first remote vital signs detection dataset focused on telemedicine applications, hereafter, the Vital-sign Imaging for Telemedicine AppLications (VITAL) dataset. VITAL is a diverse dataset containing videos of 43 subjects from a range of skin tones, demographics, lighting conditions, in-video activities and camera angles, all captured using consumer cell phones. Four vital signs, namely the heart rate, respiratory rate, oxygen saturation, and blood pressure, in addition to the ECG and PPG signal waveforms, are measured using a medical grade patient monitor.

With the creation of the VITAL dataset, this paper highlights the discrepancy of prior weighted ROI denoising techniques on darker skin tones. When testing previous r-PPG methods on VITAL, a more diverse dataset, we observe both a decrease in overall r-PPG performance and a performance gap that favors lighter skin tones. In contrast, the proposed method of Diverse r-PPG is not only the top performer across the VITAL dataset but boosts the performance of r-PPG more on darker skin tones, thereby mitigating the performance gap. Additional improvements are also observed in the challenging conditions that the VITAL dataset introduces, including facial motions (e.g. talking) and various lighting settings and camera angles. Overall, a fairer algorithmic performance is observed, laying a foundation for future work toward fairer remote PPG systems.

In order to validate the performance of remote camera-based vital sign detectors, we construct the Vital-sign Imaging for Telemedicine AppLications (VITAL) dataset. The focus of this dataset is to represent diversity in factors that are relevant to telemedicine setups, including: (i) smartphone deployment, (ii) camera view angle, (iii) recording condition diversity (lighting variation and talking), and (iv) patient demographic diversity. We address each of these aspects individually to highlight the extent of diversity in the dataset and how it was achieved:

(i) Smartphone deployment: The ubiquity of smartphones globally has led healthcare systems to establish smartphone applications that can be downloaded by patients. Such applications have been used for hosting telemedicine appointments. A deployable remote vital sign estimation solution with a focus on telemedicine must be able to work efficiently on smartphone cameras by considering factors including video compression and algorithmic time and space complexity. Moreover, the solution must achieve success independent of camera type. In order to allow for such testing, the VITAL dataset uses two different smartphone cameras for each view angle. Specifically, a Samsung Galaxy S10 and Samsung Galaxy A51 are used, which have slightly different imaging device characteristics. The use of more than one smartphone imager inspires the development of algorithms that can scale to a variety of deviceagnostic, telemedicine conditions.

(ii) Camera view angle: In a telemedicine setting, there can also be a diversity of camera angles that the algorithm must work on. In order to facilitate this estimation and verification, the VITAL dataset consists of two camera view angles for all the videos of each subject (as seen in Figure 1 ): one camera is perfectly front-on, while the other is directly in front of the face, at a dip (lower) of 15 degrees. The front-on camera is placed approximately 130 cm from the subject, and the lower camera at a dip is approximately 90 cm from the subject.

(iii) Recording condition diversity: Another essential factor involves testing algorithms across a range of recording conditions. The VITAL dataset consists of four recording conditions: (1) controlled lighting at 5600K ("cool" lighting) with the subject remaining stationary, (2) controlled lighting at 3200K ("warm" lighting) with the subject remaining stationary, (3) ambient room lighting-with distributed white LED lighting-with the subject remaining stationary, and (4) ambient room lighting with the subject speaking.

As the background could not easily be varied, a green screen backdrop is kept to potentially enable digital modification of background scenery. The motivation for collecting these varied scene conditions in VITAL was to promote the development of telemedicine algorithms that can operate in the wild.

(v) Patient demographic diversity: The VITAL dataset consists of 43 subjects spread across skin tone, age, gender and demographic diversity. Subject characteristics (gender, age, height, weight, body mass index (BMI), race, and ethnicity) are summarized in Table 1 

To benchmark the performance of the proposed method, we compare the proposed method against previous remote HR estimation algorithms. We choose the CHROM 11 signal extraction method due to its versatility and open availability of code 50 . We compare with the two most common categories of algorithmic processing steps, which we refer to as facial aggregation (c.f. 9, 11, 12, 15, 25, 26 ) and SNR weighting (c.f. 13, 33, 34, 36 ). Both these techniques are described in detail in the Methods section. We believe that these two processing steps regimes encapsulate the major processing philosophies used in existing r-PPG methods.

To ensure a fair comparison with the benchmark methods, we implement identical testing conditions across techniques. Hence, for each method, the input video is passed through the same face detection algorithm (convolutional neural network based detector 51 ), following which the eyes and mouth are cropped out using facial feature points 52 . Some methods also use skin segmentation algorithms to remove regions such as eyes and mouth 31, 35 , but we empirically found this to perform slightly worse on the VITAL dataset. We also use a consistent heart rate selection technique for each method, which also consists of a compression artifact suppression step. This is detailed in the Method section. Figure 2 shows the qualitative performance of the proposed method in comparison to the ground truth PPG and benchmark methods. The estimated pulse volume signal for the proposed method is found to visually contain peaks at the same frequency as the ground truth PPG signal. In some instances, the dicrotic notch is also present, although less prominent. Particularly noisy regions of the video are highlighted by the dashed red lines. In these time windows, the proposed method is found to visually recover peaks more distinctly with less high frequency artifacts in comparison to the benchmark r-PPG methods. Additionally, Figure 2b shows the beat-to-beat time evolution of the heart rate estimate, across the 10 second windows. Both the estimates from the ground truth signal and the output of the proposed method follow similar trends, consistently staying within 5 bpm of each other. However, as a result of the high frequency artifacts in existing methods, the estimated heart rates suffer from large errors in localized regions, worsening the overall heart rate estimate across the 2-minute video. This emphasizes both the correlation of the estimated heart rate from the proposed method with the ground truth heart rate, as well as the errors that occur with existing methods.

In order to quantitatively assess the performance of the proposed method, the following statistical metrics are used to: (i) Mean Absolute Error (MAE), (ii) Standard deviation of the error (SE) and the correlation coefficient (r) between the estimated r-PPG average heart rate and the ground truth PPG average heart rate for the entire video. Table 2 contains these metrics. In addition, Table 3 contains information about improvement in the MAE metric for the SNR weighting 13, 33, 34, 36 (30)(31)(34) and proposed methods, over the facial aggregation method. We also employ Bland-Altman (B&A) plots to compare differences in the proposed method's heart rate estimates and MX800 PPG heart rate measurements (Figures 3 and 4) . These plots are labelled with the corresponding mean difference (m) that shows the systematic bias, and the limits of agreement (LoA) within which 95% of the differences are expected to lie, estimated as LoA = m ± 1:96 σ, assuming a normal distribution. Table 1 describes the distribution of subjects across various demographic metrics. Overall, remote heart rate estimation performance was compared across 43 subjects, across 4 scene conditions and 2 camera angles, resulting in a total of 344 videos with an average length of 2 minutes. Heart rate estimation is carried out for windows of duration 10 seconds, with an overlap of 5 seconds. The overall heart rate for the subject is then estimated by averaging these window-estimated heart rates.

The qualitative accuracy in performance of the proposed method also manifests in the form of accurately estimated heart rates at a window level. Quantitatively, the proposed method shows a sub-6 beats per minute MAE for all skin tones, with an overall average MAE of 3.95 beats per minute. Similar trends are observed for the other metrics as well. As with previous methods, the performance of the proposed method is best for the light skin tone and reduces with darker skin tones; however, as compared to the benchmarks the proposed algorithm boosts the overall performance (across all skin tones); and the performance of darker skin by a larger amount. The proposed method does not obviate skin tone bias but rather is the first work that can be demonstrated to mitigate skin tone bias in the VITAL dataset. In terms of scene conditions, the performance is largely uniform across the three lighting conditions, with an average performance of 3.63 beats per minute. There is a sizable reduction in performance for the 'talking' setting. This is expected since talking inherently involves motion of the face, which will have a cascade effect on the algorithm. However, the proposed method shows an MAE of 4.87 beats per minute, again the only algorithm within 5 beats per minute for this setting.

For all three methods, performance degrades from light to dark skin. The facial aggregation approach observes a performance of 3.97, 3.80 and 6.45 beats per minute (bpm) for light, medium and dark skin tone subjects, resulting in an overall average performance of 4.41 bpm. The SNR weighting approach 13, 33, 34, 36 shows an improvement of 0.18 bpm in terms of MAE as compared to the facial aggregation benchmark, for light skin tones. However, the performance successively degrades as skin tone gets darker-medium skin tones are worse by 0.27 bpm, while dark skin tones are worse by 1.08 bpm, a significant drop. Hence, on a skin tone diverse dataset such as ours, this leads to a decrease in overall performance, with an average performance of 4.70 bpm. This is worse off by 0.29 bpm on average as compared to facial aggregation. Note the significant drop in the correlation coefficient (r) for dark skin tones, which goes from 0.48 for facial aggregation to 0.31 for SNR weighting.

In contrast, the proposed method shows significant improvement across all skin tones. That is, an improvement of 0.32 bpm, 0.48 bpm and 0.69 bpm for light, medium and dark skin tones respectively, in comparison to the facial aggregation benchmark. As a result of these significant improvements across all skin tones, overall performance on the dataset improves to a MAE of 3.95 bpm, an improvement in error performance of 0.46 bpm. Figure 3 highlights the high correlation between the proposed method's r-PPG heart rate estimates and ground truth PPG heart rate for light (r = 0.83) and medium skin tones (r = 0.84), and moderate correlation for dark skin tones (r = 0.58). The B&A plots in Figure 3d and Figure 3e show that for light and medium skin tones there is a less than 1 bpm bias, and that almost all the proposed method's r-PPG heart rate estimates are almost all within 10 bpm of the ground truth. Figure 3f shows the proposed method has a less than 2 bpm bias, and that the estimated heart rates are almost all within 13.3 bpm of the ground truth. These correlation metrics are an improvement to the benchmark methods of facial aggregation and SNR weighting.

Each of the three methods performs similarly across the three lighting conditions. The facial aggregation method shows an average MAE of 3.95 bpm across the lighting conditions, while the SNR weighting method 13, 33, 34, 36 shows an average performance of 4.43 bpm. This represents a decrease in performance of 0.48 bpm on average across the three lighting conditions. In contrast to this, the proposed method shows an average performance of 3.63 bpm across the three lighting conditions, as mentioned earlier, representing an improvement of 0.32 bpm MAE.

The performance on the 'talking' activity is worse as compared to that on other scene conditions for all three methods. However, in this case, both the SNR weighting method and the proposed method show improvement in performance over the facial aggregation benchmark. The SNR weighting method shows an improvement of 0.30 bpm over the facial aggregation benchmark. However, the proposed method shows a much larger improvement of 0.93 bpm when compared to the facial aggregation benchmark. In terms of percentage improvements, this equates to an improvement of 5.17% for the SNR weighting method, as compared to a 16.03% improvement for the proposed method. This large performance gain for the proposed method is further reinforced by looking at the correlation coefficient, which improves from 0.54 for the facial aggregation benchmark to 0.71 for the proposed method, as compared to 0.62 for the SNR weighting method. Figure 4 highlights the high correlation between the proposed method's r-PPG heart rate estimates and ground truth PPG heart rate across the various recording conditions. The dark skin tone markers across all recording conditions make up the majority of outlying data. The B&A plots in Figures 4e-h show a bias of less than 1 bpm across all recording conditions, within most heart rate estimates of the proposed method being within 10 bpm of the ground truth PPG heart rate. The largest limit of agreement is for the most challenging and noisy recording condition of subject talking. Here, the proposed method heart rate estimates are largely within 12.2 bpm of the ground truth heart rate. These correlation metrics are an improvement to the benchmark methods of facial aggregation and SNR weighting.

The final scene analysis factor is camera viewpoint. As mentioned previously, the VITAL dataset consists of two camera angles: front, where the camera is in front of the face, aimed directly at it, and bottom, where the camera is in front of the face but at a dip, so that the camera looks up to the face. 

Across the three skin tone categories, the proposed method shows both the best performance, as well as truly unbiased performance gain on the VITAL dataset. While the SNR weighting method 13, 33, 34, 36 shows performance gain only for the light skin tone subjects, with a performance drop for the other two skin tones, the proposed method shows increasing improvements in MAE across light, medium and dark skin tone. The fact that largest improvements are observed for skin tones that are traditionally worse performing attests to the fairness of the method. The proposed method is therefore the only method able to perform with the MAE being less than 6 bpm for all skin tones. These inferences are further enforced by the high increase in the correlation coefficient: the proposed method sees improvements of 7.79%, 5% and 20.83% for the three skin tones, as opposed to improvements of 10.39%, -5% and -35.42% for the SNR weighting method. Hence, in addition to the overall improvement in performance, the proposed method can infer more meaningful and correlated measurements on a video to video basis, across skin tones. Similarly, robust and diverse improvements are also observed in terms of the SE.

As a result of the above robustness in performance across diversity, the proposed method achieves an overall average performance of 3.95 bpm MAE across the VITAL dataset, as opposed to 4.41 bpm by the facial aggregation method and 4.70 bpm for the SNR weighting method. The proposed method therefore performs the best and is the only processing approach achieving sub 4 bpm MAE across the VITAL dataset. Reinforcing this observation, improvements of 8.33% and 9.15% are observed for the proposed method, in terms of r value and SE over the facial aggregation benchmark. In contrast, the SNR weighting method observes reduction in performance by 4.17% and 3.99% respectively. As the VITAL dataset gets even more diverse in terms of skin tone representation, this performance number is further expected to improve for diverse skin tones.

Large improvements in performance are observed for the talking activity. With the proposed method being the only processing method able to obtain sub 5 bpm performance on talking, we see a 16.03% improvement in MAE performance over the facial aggregation benchmark, as compared to the SNR weighting method which shows a 5.17% improvement in MAE. Similarly, significant improvements of 31.48% for the r value, and 16.06% for the SE are observed for the proposed method, as opposed to respective improvements of 14.81% and 7.42% for the SNR weighting method.

These improvements in performance are observed across camera viewpoints. The proposed method shows improvements of 8.43% and 13.67% for the front and bottom angles. This compares to performance drops for the SNR weighting method. Similar performance gains are observed for the proposed method in terms of the r value and SE across the two camera angles: improvements (% improvement in r value, % improvement in SE) of 13.64%, 9.19% and 5.13%, 9.02%. Interestingly, for all methods tested (existing and novel), the bottom angle shows improved performance as compared to the top angle. This could be because interfering factors such as hair, spectacles and so on occupy a smaller portion of the usable frame in the bottom angle.

Overall, the proposed method is found to impart truly unbiased performance gains across skin tones. As a result of the noise-oriented processing, average results and skin tone specific results for the proposed method are within acceptable medical ranges. With these result observations at hand, future work must focus on testing out and improving these methods on datasets with different 'diversities', such as testing across all possible values of heart rate and testing on patients with heart arrhythmias. This will work to further analyze and improve the performance of future works across any diverse dataset.

Ultimately, we hope this work motivates the community towards exciting and essential research avenues looking into inherent system biases associated with r-PPG. By mitigating biases, we move a step closer towards deployment of non-contact vital sensing techniques that can aid clinicians in delivering remote patient care, during times of peace and pandemic alike.

The human study protocol was approved by the UCLA Institutional Review Board (IRB), and participants provided written informed consent to take part in the study. Figure 1 shows the data collection setup.

Each subject is made to sit on a height-adjustable chair, in the field of view of two cell-phone cameras (with different view angles). We record subjects using these cameras under four different scene conditions: (1) controlled lighting at 5600K ("cool" lighting) with the subject remaining stationary, (2) 

While r-PPG based heart rate estimation has been increasingly researched over the past years, certain inherent biases and performance gaps continue to exist, across subject and scene conditions. These biases include: (i) subject skin tone, (ii) scene lighting, (iii) shadows and specular highlights (bright regions in an image which are reflections of the light source, rather than transmissions from the skin) and (iv) facial motion due to talking. In what follows, we use first principles to derive potential sources of bias, link biases to statistical noise, and develop novel denoising and debiasing algorithms, whose source code is available.

The goal of this subsection is to use light transport theory to show that appreciable error in r-PPG estimation due to dark skin is not due to biophysical factors, but instead due to imaging noise. Previous work has developed a mathematical model for skin coloration, as a function of melanin content and blood volume fraction 53 intuitively understood, the signal strength reduces with increasing skin melanin content. We can therefore infer that the corruption added to the signal, that hinders accurate inference, is not biophysical in nature (observed from the constant biophysical SINR). In contrast, the decreasing signal strength leads us to an analysis of imaging noise, which is the major noise phenomenon at play in this case.

The goal of this subsection is to numerically derive the relationship between imaging noise and r-PPG algorithm estimation. Imaging noise refers to the inherent noise that arises due to the image capture process in a commercial camera. The source of this noise is due to various effects related to photon arrival processes, thermal noise in electronics and the quantization noise associated with digitally capturing images 54 . Overall, the entire signal to noise ratio for a pixel of a particular intensity is given by:

, where p is the pixel value (ranging from 0-255), g is the sensor gain (a constant for a given image) and σ and σ are camera noise parameters (also constant). Plugging in typical values for the constants, Figure   5b . shows the trend for the SNR as a function of pixel value. The SNR is smaller for lower pixel values (corresponding to darker skin or shadowed regions) as compared to higher pixel values (corresponding to brighter skin or lit up regions). These observations, coupled with the observations from the previous subsection, allow us to make the following inferences:

(i) Imaging noise creates skin tone bias: The performance gap across skin tones, as well as across lighting differences, can be understood in terms of imaging noise.

(ii) Imaging noise and specular reflections degrade the r-PPG signal: Since the biophysical SINR is independent of skin tone, the imaging noise, coupled with specular highlights due to lighting, are the major contributing factors to signal degradation. Combating the highlighted biases in existing r-PPG algorithms to move towards robust r-PPG would therefore involve a principled approach towards reduction of the above highlighted imaging noise and specular highlight removal.

(iv) Denoising to be done before signal inference: This noise removal must be carried out in the combination step (defined below) as opposed to after signal averaging.

With the above inferences on hand, we look at the performance of existing methods that introduce averaging techniques in the combination step to improve the signal to noise ratio.

We first describe in detail a typical remote PPG algorithm pipeline for ease of understanding for the reader. There are four distinct components to this pipeline: (a) detection, which identifies facial regions of interest in the video frame, (b) combination, which condenses the information from regions of interest into a RGB time series signal , (c) signal inference, which uses the time series signal to estimate the pulse volume waveform, and (d) heart rate estimation, which estimates the heart rate from the pulse volume signal. This pipeline is visually described in Figure 6 .

The video is first passed through a neural network based face detector 51 , in order to identify the face region in the frame. Using feature point detectors, the eye and mouth regions are identified and explicitly removed from the videos (since these regions do not contribute to the pulsatile signal). This is the detection step. The next steps, namely combination, inference and heart rate step, are carried out for smaller video-windows of 10 seconds length with an overlap of 5 seconds.

For each video frame, now, the skin pixels are combined together to get one RGB sample for that time instance (the methods for this combination vary across papers and is the crux of this work's novelty).

Across all frames, after this combination, we obtain a time series RGB signal. This is the combination step.

These RGB signals are then put through an existing signal inference technique. In this paper, we use the CHROM algorithm 11 due to its versatility, as well as its easy access from openly available code 50 . The output obtained from this step results in a pulsatile waveform estimate for each window. This is the inference step.

The obtained pulsatile waveform is then processed to arrive at the final heart rate. This is the heart rate step. We first filter the waveform using a 3rd-order Butterworth bandpass filter with pass band frequencies of [0.7, 3.5] Hz. The power spectral density (PSD) is then computed. Temporal frequency artifacts were empirically observed in the original video as a result of aggressive compression, likely due to the unchanging green background. These erroneous peaks were appropriately removed. Next, the five highest peaks in the PSD are chosen. The peak with the highest combined fundamental and second harmonic power is chosen as the one corresponding to the heart rate. The final heart rate for the video is estimated as the average of the heart rate estimates for each 10 second window.

We look at existing algorithms that propose methods to improve the noise performance in the combination step. The most straightforward approach is to simply average all face pixels in a frame, in order to arrive at time samples of the RGB signal. We refer to this as facial aggregation. To improve upon this, previous approaches have sought to modify this averaging process. We describe the best performing result amongst these on the VITAL dataset. The face is gridded into smaller rectangular regions. Pixels within each region are averaged to arrive at individual time series for each region. Each of these gridded temporal signals is passed through the inference step, to obtain the corresponding blood volume signal estimate. Approaches use measures such as SNR at peak frequency of this blood volume signal to characterize the 'goodness' of each signal 13 , with higher weights being assigned for better signals. As mentioned previously, in this paper we use the two harmonic SNR estimate, which was found to be more robust. That is, for a signal s (frequency domain S) with a heart rate p, the SNR at the heart rate frequency is given by:

where w is the peak window size for estimation (for this work's experiments, we use w =0.1 Hz).

This resultant signal is passed to the heart rate step. We call this method SNR weighting 13, 33, 34, 36 . Finally, these weights are used to average the blood volume signals together. Table 2 and table 3 shows the results for facial aggregation and SNR weighting. As can be seen, the method affords an improvement for light skin tones, but shows a stark reduction in performance for medium and darker skin tones. This performance reduction can be understood in terms of the weight maps. The weight maps from previous methods (based on region-based SNR estimates) have the tendency to be sparse, especially for darker skin tones. As a result, the improvements due to weighted averaging are lost to noise corruption for darker skin tone subjects since much lesser signal is being aggregated. Datasets on which these previous methods were tested were not as diverse across skin tones: these performance caveats were therefore missed.

Additionally, the previous method of spatial gridding may also fall prey to specular highlights. Specular highlights are regions of the face where the light from the source is directly reflected from the skin to the camera. While the signal intensity is high, the signal contains no information of the pulsatile signal, which gets buried in the light from the source. Previous weighting approaches do not take this into account. This is a considerable factor when looking at scene conditions, such as camera angle, lighting direction, lighting color and intensity, as well as skin tone (since specular highlights affect darker skin more than lighter skin).

Having identified the reasons for poor performance of existing methods, we propose novelties to be incorporated in the combination step, that look to achieve a performance gain in a manner that is fair across skin tones. Specifically speaking, there are two major novelties that we propose: (i) weighting in RGB space rather than blood volume signal space and (ii) skin diffuse component weighting. We now describe each of these steps in detail.

(a) RGB-space weighting: Existing spatial averaging methods estimate weights for each grid region, based on the blood volume signal quality 33, 34, 36 . Instead of using these estimated weights to average the blood volume signals, as done in previous methods, we propose using these weights to average in RGB space. As a result, we obtain one consolidated SNR weighted RGB signal, which is again passed through the inference step to obtain the final blood volume signal.

The motivation for this modification can be understood in the context of noise. Averaging the RGB signal before passing through the inference step results in a less noisy signal passing through the inference method, enabling the inference method to provide better estimates, as compared to when noisier signals are passed through the method, to be averaged later. If the inference method is non-linear (such as CHROM 11 ), a pre weighting would lead to additional noise performance gain. The diffuse weights play two key roles in improving bias in performance as well as overall performance: first, they can, for the first time, remove specular affected regions from the average. Second, they combat the sparsity issue observed in traditional SNR weights, since the diffuse component is continuous. The SNR weights and the novel diffuse weights are multiplied together and renormalized to arrive at the final spatial weights for the gridded video. The overall pipeline, therefore, involves using the novel weights together, to arrive at efficiently weighted RGB signals. These are averaged together and passed through the estimation step and heart rate step. This pipeline is visually highlighted as such in Figure 6 .

The data that has been used to support the findings of this study is available from the corresponding author upon request and adherence to IRB protocols.

The code/software that has been used to support the findings of this study is available from the corresponding author upon request and adherence to IRB protocols.

P.C., K. Kabra, and A.K. conceptualized the overall design of the algorithm. P.C., D.S. and T.C. worked on the detection and combination steps of the proposed algorithm. D.K. and K. Kulkarni worked on the heart rate estimation step. D.K. implemented comparison benchmarks. P.C. and A.K. conceptualized the theory, which P.C derived and simulated. L.J. and A.K. initiated the IRB for data collection. P.C., K. Kabra and L.J. worked on organizing the collection and storage of data. P.C., K. Kabra, S.L., M.C., L.J., and A.K.

wrote the manuscript. M.C., L.J, and A.K. conceptualized the study. A.K. oversaw the project. LEDs are used for ambient illumination. The Philips IntelliVue MX800 patient monitor is utilized for ground truth vital sign monitoring. Two smartphone cameras at differing viewing angles capture video of the subject. c. Example frame from video captured by the smartphone camera. The subject wears a blood pressure cuff, 5-ECG leads, and a finger pulse oximeter, which is connected to the MX800 unit. between skin tones is still present, however, is reduced in comparison to previous works (see Table 2 ).

a-c. Scatter plots for different skin types. The proposed method shows moderate to strong correlation with respect to ground truth heart rates from the MX800, denoted by the Pearson Correlation weighting in RGB space, to achieve robust r-PPG performance across skin tones.

Rapid implementation of a COVID-19 remote patient monitoring program

Leveraging health system telehealth and informatics infrastructure to create a continuum of services for COVID-19 screening, testing, and treatment

Remote monitoring system of vital signs for triage and detection of anomalous patient states in the emergency room

Contactless Vital Signs Measurement System Using RGB-Thermal Image Sensors and Its Clinical Screening Test on Patients with Seasonal Influenza

Remote vital parameter monitoring in neonatology -robust, unobtrusive heart rate detection in a realistic clinical scenario

Remote Photoplethysmographic Assessment of the Peripheral Circulation in Critical Care Patients Recovering From Cardiac Surgery

Intensive care telemedicine: evaluating a model for proactive remote monitoring and intervention in the critical care setting. Stud Health Technol Inform

Robust Pulse Rate From Chrominance-Based r-PPG

Non-contact, automated cardiac pulse measurements using video imaging and blind source separation

Robust non-contact vital signs monitoring using a camera

Motion robust PPG-imaging through color channel mapping

Algorithmic Principles of Remote PPG

Video-Based Physiologic Monitoring During an Acute Hypoxic Challenge: Heart Rate, Respiratory Rate, and Oxygen Saturation

Simple Neonatal Monitoring by Photoplethysmography

DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks

Remote Heart Rate Measurement From Highly Compressed Facial Videos: An End-to-End Deep Learning Solution With Video Enhancement

IEEE/CVF International Conference on Computer Vision (ICCV)

Vital sign monitoring utilizing Eulerian video magnification and thermography

New insights on super-high resolution for video-based heart rate estimation with a semi-blind source separation method

Self-Adaptive Matrix Completion for Heart Rate Estimation from Face Videos under Realistic Conditions

Remote Photoplethysmograph Signal Measurement from Facial Videos Using Spatio-Temporal Networks

Constrained independent component analysis approach to nonobtrusive pulse rate measurements

Measuring pulse rate with a webcam -A non-contact method for evaluating cardiac activity

Improved motion robustness of remote-PPG by using the blood volume pulse signature

End-to-End Heart Rate Estimation From Face via Spatial-Temporal Representation

The Benefit of Distraction: Denoising Remote Vitals Measurements using Inverse Attention

Towards Driver Monitoring Using Camera-Based Vital Signs Estimation in Near-Infrared

Remote Heart Rate Measurement from Face Videos under Realistic Situations

Exploiting Spatial Redundancy of Image Sensor for Motion Robust r-PPG

Remote plethysmographic imaging using ambient light

Block-based adaptive ROI for remote photoplethysmography

Unsupervised skin tissue segmentation for remote photoplethysmography

Non-contact Heart Rate Monitoring by Combining Convolutional Neural Network Skin Detection and Remote Photoplethysmography via a Low-Cost Camera

IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Model-based Region of Interest Segmentation for Remote Photoplethysmography

Non-Contact Monitoring of Breathing Pattern and Respiratory Rate via RGB Signal Measurement

Remote Respiratory Monitoring in the Time of COVID-19

Robust respiration detection from remote photoplethysmography

Introducing Contactless Blood Pressure Assessment Using a High Speed Video Camera

CamBP: a camera-based, non-contact blood pressure monitor

MobiEye: turning your smartphones into a ubiquitous unobtrusive vital sign monitoring system

Camera-based pulse-oximetry -validated risks and opportunities from theoretical analysis

Noncontact Monitoring of Blood Oxygen Saturation Using Camera and Dual-Wavelength Imaging System

Calibration of Contactless Pulse Oximetry

Remote sensing of multiple vital signs using a CMOS camera-equipped infrared thermography system and its clinical application in rapidly screening patients with suspected infectious diseases

Visual Heart Rate Estimation with Convolutional Neural Network

A Meta-Analysis of the Impact of Skin Tone and Gender on Non-Contact Photoplethysmography Measurements

The Validity and Practicality of Sun-Reactive Skin Types I Through VI

An Open Non-Contact Imaging-Based Physiological Measurement Toolbox

Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks

One millisecond face alignment with an ensemble of regression trees

A Biophysical 3D Morphable Model of Face Appearance

Noise-optimal capture for high dynamic range photography

Real-Time Specular Highlight Removal Using Bilateral Filtering