key: cord-0258476-s4ear472 authors: Hirten, R. P.; Tomalin, L.; Danieletto, M.; Golden, E.; Zweig, M.; Kaur, S.; Helmus, D.; Biello, A.; Pyzik, R.; Bottinger, E. P.; Keefer, L.; Charney, D.; Nadkarni, G.; Suarez-Farinas, M.; Fayad, Z. A. title: Evaluation of a Machine Learning Approach Utilizing Wearable Data for Prediction of SARS-CoV-2 Infection in Healthcare Workers date: 2021-11-05 journal: nan DOI: 10.1101/2021.11.04.21265931 sha: a62e1749abb90ca372607cd708a6604827f3ab85 doc_id: 258476 cord_uid: s4ear472 Importance: Passive and non-invasive identification of SARS-CoV-2 infection remains a challenge. Widespread use of wearable devices represents an opportunity to leverage physiological metrics and fill this knowledge gap. Objective: To determine whether a machine learning model can detect SARS-CoV-2 infection from physiological metrics collected from wearable devices. Design: A multicenter observational study enrolling health care workers with remote follow-up. Setting: Seven hospitals from the Mount Sinai Health System in New York City Participants: Eligibility criteria included health care workers who were 18 years of age or older, employees of one of the participating hospitals, with at least an iPhone series 6, and willing to wear an Apple Watch Series 4 or higher. We excluded participants with underlying autoimmune/inflammatory diseases, and medications known to interfere with autonomic function. We enrolled participants between April 29th, 2020, and March 2nd, 2021, and followed them for a median of 73 days (range, 3-253 days). Participants provided patient-reported outcome measures through a custom smartphone application and wore an Apple Watch, collecting heart rate variability and heart rate data, throughout the follow-up period. Exposure: Participants were exposed to SARS-CoV-2 infection over time due to ongoing community spread. Main Outcome and Measure: The primary outcome was SARS-CoV-2 infection, defined as a self-reported positive SARS-CoV-2 nasal PCR test. Results: We enrolled 407 participants with 49 (12%) having a positive SARS-CoV-2 test during follow-up. We examined five machine-learning approaches and found that gradient-boosting machines (GBM) had the most favorable 10-CV performance. Across all testing sets, our GBM model predicted SARS-CoV-2 infection with an average area under the receiver operating characteristic (auROC)=85% (Confidence Interval 83-88%). The model was calibrated to improve sensitivity over specificity, achieving an average sensitivity of 76% (CI ~4%) and specificity of 84% (CI ~0.4%). The most important predictors included parameters describing the circadian HRV mean (MESOR) and peak-timing (acrophase), and age. Conclusions and Relevance: We show that a tree-based ML algorithm applied to physiological metrics passively collected from a wearable device can identify and predict SARS-CoV2 infection. Utilizing physiological metrics from wearable devices may improve screening methods and infection tracking. Infection prediction traditionally relies on the development of characteristic symptomatology, prompting confirmatory diagnostic testing. However, the SARS-CoV-2 infection poses a challenge to this traditional paradigm given its variable symptomatology, prolonged incubation period, high rate of asymptomatic infection, and variable access to testing. 1, 2 Ongoing case surges throughout the world, prompted by the delta variant, are characterized by greater infectivity and raise the possibility that SARS-CoV-2 may become endemic. While highly effective vaccines against SARS-CoV-2 have been developed, limited vaccine supplies, low vaccination rates in some communities and the evolution of variants, have prompted ongoing infectious spread. 3 Novel means to identify and predict SARS-CoV-2 infection are needed. Wearable devices are commonly used and can measure multi-modal continuous data throughout daily life. 4 Increasingly, they have been applied to applications in health and disease. 5 Researchers have previously demonstrated that the addition of wearable sensor data to symptom tracking apps can increase the ability to identify Corona Virus Disease-2019 (COVID-19) patients. 6 Additionally, the combination of heart rate, activity, and sleep metrics measured from wearable devices was able to identify 63% of COVID-19 cases before symptoms, further demonstrating the promise of this approach. 6, 7 Our group launched the Warrior Watch Study, which employed a custom smartphone app to remotely monitor health care workers (HCWs) throughout the Mount Sinai Health System. 8 This app delivered surveys to the subject's iPhones and enabled passive collection of Apple Watch data. We previously demonstrated that significant changes in heart rate variability (HRV), the small differences in time between each heartbeat that reflect autonomic nervous system (ANS) function, collected from the Apple Watch, occurred up to 7 days before a COVID-19 diagnosis. 8, 9 Building on these observations, our primary aim was to determine the feasibility to train and validate machine learning approaches combining HRV measurements with resting heart rate (RHR) metrics to predict COVID-19 before diagnosis via nasal polymerase chain reaction (PCR). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint We recruited HCWs for this prospective observational study from seven hospitals in New York City (The Mount Sinai Hospital, Morningside Hospital, Mount Sinai West, Mount Sinai Beth Israel, Mount Sinai Queens, New York Eye, and Ear Infirmary, Mount Sinai Brooklyn). 8 Subjects were ≥18 years, employees at one of these hospitals, had at least an iPhone series 6, and were willing to wear an Apple Watch Series 4 or higher. Underlying autoimmune or inflammatory diseases, as well as medications known to interfere with ANS function, were exclusionary. The study was approved by the Mount Sinai Hospital Institutional Review Board, and all subjects provided informed consent prior to enrollment. Subjects downloaded the Warrior Watch Study app, signed the electronic consent, and completed baseline demographic questionnaires. Prior COVID-19 diagnosis, medical history, and occupation classification within the hospital were collected via in-app assessments. Subjects completed daily surveys to report any COVID-19 related symptoms, symptom severity, the results for any SARS-CoV-2 nasal PCR tests, and SARS-CoV-2 antibody test results. A positive diagnosis was defined as a self-reported positive SARS-CoV-2 nasal PCR test. Subjects were asked to wear the Apple Watch for at least 8 hours per day (Figure 1a ). Subjects wore an Apple Watch Series 4 or higher, which are commercially available wearable devices that connect via Bluetooth to participants' iPhones. The Apple Watch uses infrared and visible-light light-emitting diodes and photodiodes that act as a photoplethysmogram generating time series peaks from each heartbeat. 10 There is a moving average window during which heart rate measurements are calculated while the device is worn. HRV is automatically calculated in ultra-short 60 second recording periods as the standard deviation of the inter-beat interval of normal sinus beats (SDNN), a time-domain index. 9 SDNN reflects sympathetic and parasympathetic nervous system activity. The Warrior is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint Watch Study app collects the generated SDNN and heart rate measurements at survey completion. Our primary analysis consisted of measurements of HRV. HRV follows a circadian pattern that can be characterized by three parameters, namely the MESOR (M: the mean HRV during the day), amplitude (A: maximum HRV during the day), and the acrophase (Ψ: describing when the maximum occurs). 8 We previously developed a mixed-effects COSINOR model to compare HRV circadian patterns at the group level and show that changes in those parameters were associated with infection. 8 Given these findings, daily measurements of HRV were incorporated as potential diagnostic biomarkers for our machine-learning approach. HRV measurements for each day were sparse and were not taken at regular intervals. Thus, daily estimates of HRV COSINOR parameters M, A and Ψ could not be calculated. Due to this limitation, we estimated the daily HRV parameters for each subject and day (tn) using HRV data from a seven-day sliding window (tntn-6), thereby creating daily smoothed estimates reflecting changes in the last 7 days (Figure 1b) . To aid the optimization procedures, each subject's initial estimates are obtained using the first two weeks of data from each subject fitted to a mixed-effect COSINOR model with A, M, and Ψ as random effects. 8 From this model, the subjectspecific baseline A, M, and Ψ is derived and used to initialize the iterative 7-day smoothed estimates within each patient. If the number of days in the 7-day window was < 3, the window was expanded to 14 days (tn-14). In rare cases, no data was available over 14-days, and parameters were imputed using the Last Observation Carried Forward (LOCF) imputation method. During each window, we also measured the maximum, minimum, mean, and standard deviation (SD) of the RHR. For each day and subject, there were a total of 8 digital biomarkers used to develop our predictive models: HRV-amplitude, HRV-MESOR, HRV-acrophase, daily RHR, RHRmax, RHR-min, RHR-sd, RHR-mean, and 3 demographic variables known to impact HRV-BMI, age, and gender. 11 Data was split into independent training and testing sets, ensuring that observations were taken on chronologically similar days (e.g., Day 6 and Day 7), for . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint the same subject, were in the same set. A sampling procedure was employed that ensured that observations with proximity in time (±4 days), for the same subject, did not appear in both training and testing sets. This procedure was created 100 training and testing sets, containing 90% and 10% of the data respectively. Care was also taken to ensure that the prevalence of COVID-19 positive (COVID+) diagnoses in each set was similar to the prevalence of the full data set. Machine learning model training and evaluation were performed using caret and pROC packages, with tuning parameters estimated using 10-fold crossvalidation (10-CV). To safeguard against biases induced by the low prevalence of COVID+ samples, we considered several sampling methods to balance the data during 10-CV, ultimately using SMOTE (synthetic minority over-sampling technique) which had the most favorable 10-CV performance. 12 Models were trained on each of the 100 training sets, and their performance (auROC, partial-auROC, accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, balanced accuracy) was assessed on the corresponding testing set and presented as mean with 95% CI. The sensitivity of the diagnostic algorithm was prioritized since the application of wearable devices as a non-invasive screening modality would be to prompt a confirmatory PCR test. Our models were trained to maximize partial-auROC (sensitivity boundary of >75%), with tuning parameters estimated using 10-fold cross-validation (CV). When exploring the training data, 10-CV performance for several different machine-learning algorithms was assessed (gradient-boosting machines, elastic-net, partial least squares, support vector machines and random forests). However, a gradient boosting machine model (GBM) was selected as the best performing and was used to develop our statistical classifier (Supplementary Table 1) . When calibrating the model, the 10-CV predictions were used to optimize the probability threshold such that the sensitivity was above >98%. The average value of this probability threshold, over all 100 iterations, was then used to define the final decision rule where cases with a predicted probability above this threshold were considered COVID+. We used a previously described method to estimate each feature's relative influence/importance in the model, over all 100 training sets. 13 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint Four hundred and seven HCWs were enrolled between April 29 th , 2020, and March 2 nd , 2021 ( Table 1 ). The mean age of participants at enrollment was 38 years (SD 9.8), and 34.2% were men. A positive SARS-CoV-2 nasal PCR was reported by 12.0% (49/407) of participants during follow-up (Figure 1c) . The median follow-up time was 73 days (range, 3-253 days) for a total of 28,528 days of observations. A median of 4 HRV samples were collected at varying times per participant per day, and daily measures of RHR. Given the low prevalence of COVID+ observations (<1% of all daily observations were COVID+), and to avoid biased performance metrics resulting from a single split, the data was split into 100 training (including ~90% of the data) and testing (~10%) sets, using a strategy that guarantees independence between testing and training sets. This procedure produced robust estimates of the model performance in the testing set as well as 95% CI ( Figure 1D) . The 10-CV performance of several different machine-learning methods was explored (Supplementary Table 1 ). This analysis revealed that the non-linear methods (AUC>98%), GBM and random forest (RF), outperformed all linear methods (AUC<66%), suggesting a non-linear relationship between HRV and SARS-CoV-2 infection. Although the RF model had a higher AUC than GBM, the training performance of RF was consistently high and demonstrated a larger discrepancy between training and 10-CV sensitivity (Supplementary Figure 1) . Given this overfitting of the data we used GBM to develop our final classifier. ROC curves calculated for GBM using all training and 10-CV samples show a high AUC (>98%) (Figure 2a-b) . The sensitivity and specificity were >95% in the training data (Supplementary Table 2) . However, the sensitivity in 10-CV was comparatively lower (~91%) (Figure 2a) , likely due to the low prevalence of COVID+ is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint examples in the training data and biasing the model to predict COVID-outcomes, suggesting that sensitivity estimates in the training data are an overestimate. We calibrated the final decision rule to guarantee high sensitivity, as a wearable device-based algorithm would be utilized for screening ( Table 2) . This calibrated decision rule increases the true positive rate by allowing for a larger rate of false-positive results. To keep the testing performance unbiased, we used the training data to optimize the decision rule to guarantee a sensitivity >98% (Figure 2c ). This optimal decision rule was 0.19 (Figure 2c) and produced an average 10-CV Accuracy (Figure 2d ) of 84% (CI ±~0.3%), with 98% sensitivity and 84% specificity, thus indicating a specificity loss of 12%, for a 7% gain in sensitivity compared to the standard 0.5 decision threshold. When the calibrated diagnostic rule was applied to testing data, an AUC >85% (Figure 2d-e) was archived (median= 88%). Accuracy and specificity were 84% (CI ±~0.4%) (Figure 2d) . The mean sensitivity was 76% (CI ±~4%). The four most important/influential predictors were HRV acrophase, HRV MESOR, age and BMI (Figure 2f) , with median importance >75%. RHR metrics (maximum, minimum, SD, mean) as well as HRV amplitude, were less influential (median importance 25-50%). Gender had importance equal to 0 in most models. To visualize the relationship between feature values and model prediction, we selected the 9 patients for which the model was best able to predict COVID-19 (AUC>98.9% 10-CV), and plotted the acrophase, amplitude, MESOR and max RHR, as well as the predicted probability, for each day (Figure 3) . This analysis revealed a complex relationship between HRV parameters and SARS-CoV-2 infection. It was notable that, for most subjects, the predicted probability increased when HRV amplitude decreased, which is consistent with our previously published analysis. 8 Our results demonstrate that a machine learning approach applied to the physiological metrics measured by a wearable device reliably identifies and predicts is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint SARS-CoV-2 infections. This highlights the potential utility of assessing individual changes in passively collected physiological data from wearable devices to facilitate the management of the COVID-19 pandemic. Infections alter physiological metrics differentiating infected and uninfected states. Changes in vital signs in the setting of infection, including increased heart rate, elevated respiratory rate, and altered body temperature, have been well described. 16, 17 In addition to these traditional physiological metrics, ANS function, measured by HRV, is altered during illness. Several small studies have shown that changes in HRV can identify and predict infections. 18, 19 Building on these observations and the growing capabilities of wearable technology, wearable devices have been increasingly explored in the setting of infection. They provide a unique means to measure physiological parameters and offer an advantage over periodic assessments in the clinical setting by collecting real-time continuous measurements. 20 This approach can identify trends in individual physiological outputs. On a population level, retrospective analysis of physical activity and heart rate data collected from Fitbits was shown to improve influenza-like illness predictions. 21 This approach applied to an individual level was explored during the COVID-19 pandemic. SARS-CoV-2 alters physiological metrics commonly measured by wearable devices. 22 Quer et al. and colleagues collected symptom data and physiological metrics from smartwatches. They found that while resting heart rate could not discriminate SARS-CoV-2 infections from negative cases (AUC of 0.52) when combined with sleep, activity, and symptom-based data, the AUC increased to 0.80. They demonstrated that the addition of wearable-based data significantly improved the ability of symptoms alone to discriminate between those positive or negative for COVID-19. 6 Similarly, Mishra et al. demonstrated that heart rate, physical activity, and sleep time were collected from wearable devices could detect COVID-19. They found that 26 of the 32 COVID-19 positive subjects in their cohort had significant alterations of these metrics before diagnosis or symptom development and that 63% of cases could be detected before symptom onset. 7 Our group previously demonstrated that changes in the circadian pattern of HRV were associated with a COVID-19 diagnosis. 8 We demonstrated that significant changes, particularly in the amplitude of SDNN, were observed over the 7 days before diagnosis. Based on this observation, we built a machine learning algorithm is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint that incorporated HRV circadian rhythm, RHR parameters, and demographic characteristics that can easily be collected from wearable device users. We trained a predictive model and then demonstrated the ability to accurately predict COVID-19 status in new data with relatively high sensitivity (76%) and specificity (84%), compared to the current gold standard of SARS-CoV-2 nasal PCR testing. This model's high sensitivity and the minimal demographic data required lends itself to easy deployment. Our model has an advantage over prior publications evaluating the relationship between wearable-based data and a COVID-19 diagnosis, in that we trained a predictive model and then demonstrated its accuracy in predicting COVID-19 status in new data. [6] [7] [8] There are several limitations to our study. First, HRV was collected sporadically by the Apple Watch. We employed statistical modeling to account for this. However, a denser data set using continuous data would likely further improve our predictions. Second, the model we employed used a 7-day smoothing approach. This approach observed infection-induced changes in HRV later than if HRV was estimated using a single-day method. Thus, the approach we employed is conservative. An additional limitation is that the Apple Watch provides HRV measurements only in the SDDN time domain. This limits assessments between other types of HRV measurements and COVID-19 outcomes. Additionally, other factors might impact HRV, which we were not able to capture and control for in the analysis. Furthermore, we were not routinely checking for SARs-CoV-2 infections and relied on subjects reporting a COVID-19 diagnosis. Therefore, infections could have occurred that are not accounted. Lastly, we did not externally validate our machine learning algorithm in another cohort. We demonstrate that a machine learning algorithm combining circadian features of HRV with features of resting heart rate derived from the Apple Watch achieves high sensitivity and specificity in predicting the development of COVID-19. While further validation is necessary, this non-invasive and passive modality may be helpful to monitor large numbers of people for possible infection with SARS-CoV-2 and help direct testing toward high-risk individuals. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint The authors declare no relevant conflicts of interest. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint Subjects wore smartwatches that collect measurements of HRV and RHR. Subjects answer daily surveys to provide health outcomes including COVID test results. (B) Each day each subject is labelled as either; COVID+ if observation was made within ±7 days of the patients first positive COVID-19 test, otherwise the observation is labelled COVID-. (C) HRV measurements were too sparse to estimate HRV COSINOR parameters (MESOR, Amplitude, Acrophase) for each day, thus, we estimated smoothed parameters using a 7-day sliding window. RHR (mean, sd, min, max) was also estimated over this window. (D) The data was split into 100 training and testing sets, models were fit to the training data and performance was estimated using 10fold CV. 10-CV predictions were used define a decision rule that increases sensitivity, this decision rule was applied to the predictions in the testing data to get the final performance. . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted November 5, 2021. ; https://doi.org/10.1101/2021.11.04.21265931 doi: medRxiv preprint Clinical features of patients infected with 2019 novel coronavirus in Wuhan Temporal dynamics in viral shedding and transmissibility of COVID-19 A comprehensive SARS-CoV-2 genomic analysis identifies potential targets for drug repurposing Wearable Devices Are Well Accepted by Patients in the Study and Management of Inflammatory Bowel Disease: A Survey Study Digital Health: Tracking Physiomes and Activity Using Wearable Biosensors Reveals Useful Health-Related Information Wearable sensor data and self-reported symptoms for COVID-19 detection Pre-symptomatic detection of COVID-19 from smartwatch data Use of Physiological Data From a Wearable Device to Identify SARS-CoV-2 Infection and Symptoms and Predict COVID-19 Diagnosis: Observational Study Heart rate variability. Standards of measurement, physiological interpretation, and clinical use. Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology SMOTE: Synthetic Minority Oversampling Technique Greedy function approximation: A gradient boosting machine pROC: an open-source package for R and S+ to analyze and compare ROC curves Building Predictive Models in R Using the caret Package Fever and cardiac rhythm The Importance of Respiratory Rate Monitoring: From Healthcare to Sport and Exercise. Sensors (Basel) Sample asymmetry analysis of heart rate characteristics with application to neonatal sepsis and systemic inflammatory response syndrome Continuous multi-parameter heart rate variability analysis heralds onset of sepsis in adults Wearable devices for the detection of COVID-19 Harnessing wearable device data to improve state-level real-time surveillance of influenza-like illness in the USA: a population-based study Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the New York City Area