key: cord-0698794-7fczfr99
authors: Nestor, B.; Hunter, J.; Kainkaryam, R.; Drysdale, E.; Inglis, J. B.; Shapiro, A.; Nagaraj, S.; Ghassemi, M.; Foschini, L.; Goldenberg, A.
title: Dear Watch, Should I Get a COVID-19 Test? Designing deployable machine learning for wearables
date: 2021-05-17
journal: nan
DOI: 10.1101/2021.05.11.21257052
sha: 23bc5879b6c4a9fad5b8061051d5f87463311280
doc_id: 698794
cord_uid: 7fczfr99

Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically significant differences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32,198 individuals who experience influenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be effectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classification performance at 0.73 AUROC. However, an average AUROC of only 0.55 {+/-} 0.02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to differentiate ILI days from healthy days, followed by a survey to differentiate COVID-19 from influenza and unspecified ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.

The use of wearable devices has skyrocketed in recent years -about 3 in 10 people in the USA wear a fitness or health tracking sensor [62] . The abundance in physiological data has lead to the development of algorithms to deliver healthcare at the edge, with regulatory approval in applications such as atrial fibrillation [45] . In the wake of the global COVID-19 pandemic there has been unprecedented need for early detection of COVID-19 in a way which differentiates it from seasonal influenza or otherwise unspecified influenza-like-illnesses (ILI), due to the substantial overlap in signs and symptoms. Jurisdictions have circumvented transmission through extensive testing and proper infection control through physical distancing, isolation of positive cases, and contact tracing [26; 48] . In contrast, a reduced capacity for testing has been identified as a rate-limiting step in the control of the pandemic [25] . Wearable devices can contribute to testing infrastructure, and in turn, reduce transmission. Studies conducted prior to and throughout the pandemic demonstrate that the physical manifestations of ILI, and in particular, COVID-19 can be captured using wearable devices [35; 38; 43; 51] . To understand the usefulness of wearable devices for the public health infrastructure, proposed models must be evaluated in a scenario that resembles conditions of practical deployment. It is essential to exhaustively evaluate the potential impacts of models based on wearable data to avoid unattainable expectations or under-utilisation. RESULTS We collected a dataset, referred to as FLUCOVID, which contains survey and wearable device data collected between November 1, 2019, and June 1, 2020. There are 32, 198 participants with both surveys and wearable data in this cohort, 2, 554 had medically diagnosed influenza, 204 had COVID-19, and the remainder had unspecified ILI. A complete description of the data collection process can be found in [53] . Our survey data proportionally matched peaks and trends in CDC data for influenza [8] and COVID-19 incidences [16] (See Figure 1 ). Using self reported symptom onset and recovery dates, combined with reports of medically diagnosed COVID-19 we define the task as a daily classification of whether a date is within the self reported ILI symptom dates (label of 1) or outside of that region during presumably healthy days (label of 0). Under this paradigm, day 0 is the first day in which a participant experiences symptoms, despite acquiring the virus and still being capable of transmitting the virus prior to symptom onset (negatively labelled days in the range of -14 to -1).

We investigated several prepossessing steps for the input data and the outcome. We investigated normalizing individuals' covariates by their own history as was previously done in the literature [51] . However, we found that this step did not outperform simple normalisation of the individual's covariates by the training set population means. We thus normalise our data by the training set population means. This step allows to make more predictions immediately after a user enrols rather than waiting to accumulate sufficient data. In the spirit of using the most data available, we did not eliminate users with vast amounts of missing data due to lack of wear-time or missing channels. This deviates from the convention accepted in previous works [35; 38; 43; 50] but allows us to make predictions possible even in tough edge cases. To address the increased amount of missing data due to the broader inclusion criteria, we mean imputed missing data at the beginning of the study to left align the participants and forward filled data after an individuals' steps, heart rate, or sleep time has already been observed. Note the XGBoost [12] and GRU-D [11] models are capable of handling missing data. See further details on input data pre-processing in the Online Methods and the Appendix.

For the survey response covariates, survey responses are given once per week recounting symptoms experienced on each day. These responses are disambiguated into binary daily symptom flags for each of the reported symptoms or behaviours. Unfortunately, some participants respond to surveys weeks after their ILI event to provide onset and recovery dates. These individuals are retained during all training and testing procedures. Wherever no information is reported it is assumed that participants are not experiencing any symptoms and are following all COVID-19 precautions. The day-level ILI labels are constructed by labelling all days between self-reported symptom onset and self-reported recovery as positive. This differs from previous work where a fixed number of days around symptom onset are labelled as positive days [35; 43; 50] . We did this in order to be able to make predictions continuously on any date prior to, or after symptom onset, to better simulate deployment. It is important to remember that our detector is not a wearable COVID-19 specific detector, it is a detector of ILI events that are a superset of any medically undiagnosed ILI, self-reported medically diagnosed flu, or self-reported medically diagnosed COVID-19. Finally, approximately a third of users had multiple reported events. When these events are separated by fewer than 7 days, they are merged into a single event, with the COVID-19 label superseding the unspecified ILI label.

All presented wearable models are trained using 48 variables representing sleep, heart rate, and steps measurements taken from wearable devices aggregated on a daily basis.

Throughout this work we use the following definitions to distinguish training and evaluation schemes which we compare in this paper:

1. Random splits of data refers to a random shuffling of historic data to create train, validation, and test event splits. We also refer to it as retrospective evaluation, since the training events and testing events are all occurring in the past. In such a scenario, the prevalence of COVID-19 and ILI are likely to be the same across train and test, a conventional but not a realistic deployment scenario. 2. Prospective evaluation is the setting where test data is chronologically after any of the training and validation datasets and data splits. We know that over a single season of data, influenza cases surge and fall, and with emergent COVID-19 infections cases are anything but stationary. The prevalence in the week after a COVID-19 wearable model deployment is unlikely to match the prevalence experienced in the training set when real life is concerned. Inevitably, a deployment-ready model would have to face this modelling challenge so it makes sense to test in this manner to avoid providing the public with poorly calibrated COVID-19 or ILI predictions. Figure 1 . Influenza and COVID-19 incidence: Publicly available data for outpatient influenza from the CDC ILINet and COVIDcast case counts are compared to our incidence of survey ILI and COVID-19 cases. The COVID-19 data is scaled as denoted in the legend. After COVID-19 becomes more prevalent than ILI, as denoted by the vertical line, the ILI survey cases show an unusually high proportion of unspecified ILI, compared to the CDC outpatient counts. Some unspecified ILI after March 16 may be attributed to undiagnosed COVID-19.

Aggregate metrics from randomly sampled cohorts artificially inflate performance

In the instance of seasonal influenza, models are able to use trends over time estimate current trends [51] . With emergent diseases like COVID-19, there is a lack of past season data to rely on to create such estimates. We trained XGBoost [12] model on 5 random splits of data (5 repetitions of 35% training, 7.5% validation, 7.5% held out retrospective test set, and 50% held out prospective test set by participant) to detect ILI events. When analyzing the performance of randomised test splits, we found that most of the AUROC metric was driven by between week comparisons, giving an inflated estimate of performance. For example, knowledge that first week of February had a high prevalence of influenza would lead the model to make more calls in the first week of February in the test set. Dissecting the test set predictions to their constituent weeks lowered the AUROC for nearly all weeks (See Figure 2 ). Removing the week-of-year feature helped mitigate the overestimate for the XGBoost model, but there was still performance drop from the overall AUROC, to the weekly AUROC on the randomly drawn test set ( Figure 2 ). These results indicate that the model had learned to predict the prevalence of ILI in each week, rather than connecting the underlying physiological data to the outcome. Note that in each instance, the machine learning model outperforms a non-machine learning baseline that was designed for population level influenza surveillance [51] .

In emergent diseases, one would imagine training a model on all available data, selecting parameters, then deploying the model. The new cases that this model predicts on will be temporally disjoint from the training data. Then in the following week, the process would be repeated by training a freshly initialised model with an additional week of data. Naturally this creates a situation where there is a continuously growing training set, a sliding validation . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint Figure 2 . XGBoost trained on randomly split ILI data: Conditioning the performance by time (solid lines) reduces performance compared to when performance is assessed irrespective of time (dashed lines) for randomly drawn training, validation, and testing sets. All machine learning models outperform the non-machine learning elevated RHR baseline.

set window, and a sliding test set window (See Figure S1 ). This deployment-style evaluation is in contrast to the conventional ML random split dataset scenario used in the previous section and related work [43] , since none of the future observations are seen during training. Additionally, our prospective evaluation only contains participants who are distinct from the participants in the training data for that week's model for FLUCOVID data. Each consecutive week features a new, 50% of that week's participants (sampled to obtain uncertainty estimates). With each new model trained on the additional week of data, we also reserve a 7.5% random split of the data that is concurrent with the training set, since by the time of deployment, the participants in this set have already been encountered. This allows us to evaluate our models on a standard random split setting as one would in a typical ML workflow that is negligent to temporal patterns in data. We find that the perceived performance from the random split overestimates the prospective evaluation in nearly every week for both models (See Figure 3 ). Random test set evaluation overestimates our expected performance for ILI detection using wearable devices. These findings exemplify that the danger of overoptimistic result estimation is real when test setting is not representative of deployment scenario.

Given the incidences of ILI and COVID-19 are non-stationary throughout the year, we attribute the drop in performance to an over-reliance on memorising the COVID-19 prevalence in a given week as before. This attribution is supported by the observation that by removing the week-of-year feature there is a corresponding drop in performance in retrospective test sets for XGBoost, which is not observed in prospective test sets (See Figure 3) . To elicit the cause of the dataset shift, we apply PCA and t-SNE visualisations to the input data. When the week-of-year feature is absent, data do not show coherent clustering when falsely coloured by time. This temporal feature is only covariate with a large contribution to the temporal variance (See Figure S2 ). We observe that the means of features change throughout the study for their respective classes (See Figure S3 ). However, covariate shift is not the main source of . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint [40] . We address this in model development by selecting thresholds on a prospective validation set, instead of using a randomly selected validation set. We found this to be essential to attain reasonable performance on future data during the ebbing prevalence of COVID-19 with a single season of data (See Figure S8 ).

For any prospective week, a model is trained to differentiate symptomatic COVID-19 days from symptomatic ILI days, or healthy days prior to illness. When models are evaluated on prospective data, we find that the margins of separation for COVID-19 vs. non-COVID-19 ILI & healthy classes is small. These weekly results are reported in Figure 4a for XGBoost. By nature of the FLUCOVID study design, only individuals who eventually experience an ILI are included. For any given week using wearable devices to directly detect COVID-19 achieves a sensitivity of 0.52 (0-1, 95% CI), with a false positive rate of 0.49 (0.04-0.99, 95% CI). The test set sensitivity will not change as more negative controls are added to the model, though all other metrics (specificity, positive predictive value, negative predictive value, AUROC, and AUPR) are affected by the negative class proportion. The reported scores are conservative, as we have a disjoint set of participants and non-intersecting date ranges between the training, validation, and test sets. In reality only a small proportion of participants would be expected to enter or leave the survey each week, meaning the deployment performance would only be harmed by lack of generalisation through time, rather than lack of generalisation through time and across participants.

Ultimately, it could be easier to differentiate between COVID-19, ILI, and healthy days using daily surveys [50] . However, long-term engagement in digital health studies is a burden to the model's beneficiaries [47] . Survey models currently upper bound mobile COVID-19 diagnostics. Though, surveys will still fail to detect pre-symptomatic, asymptomatic, and possibly mild symptomatic COVID-19 detection. Recall that for COVID-19, 17% of cases . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Figure 4 . Wearable vs. survey model performance:Performance of two data modalities, wearable data or survey data, for directly predicting COVID-19 each week. Results are shown as mean ± standard deviation are estimated to be asymptomatic [5; 39] , which is comparable to the rate of asymptomatic influenza [32] . Both asymptomatic COVID-19 [30] and asymptomatic influenza [21] shed viral RNA. For those who eventually develop symptoms, the relatively long incubation time of SARS-COV-2 is problematic as 44% of transmission occurs during pre-symptomatic phase [17] , though active viral load peaks around symptom onset [9] . These sources of transmission are unable to be prevented with symptom-based screening alone. Our data is collected such that a participant would not be asked to complete a survey if they did not have symptoms. The task for this survey model is to detect which symptoms are a result of COVID-19, and which can be presumed to be non-COVID-19 ILI. Models trained on the daily symptom history and demographic covariates to distinguish COVID-19 positive individuals from other ILI. The top performing model, a GRU, achieves this task with an AUROC of 0.78 ± 0.19 and an AUPR of 0.28 ± 0.11 for a prospective week, on a disjoint set of participants. The sensitivity and specificity of this method is shown in Figure 4b . In comparison, the linear model on a curated set of features as reported in [34; 50] yielded AUROC of 0.62 and AUPR 0.03 on our entire FLUCOVID survey data. Since the nature of the task is to determine if symptoms are caused by COVID-19 or not, and since all ILI cases are included in the study by design, the reported metrics would remain valid even if additional healthy individuals were included.

For any sensitivity chosen on the wearable model, we always improve the likelihood that the individual has COVID-19 (See Figure S10 ). Hence for a permissive sensitivity, we can dismiss many healthy individuals, which minimises reliance on continued survey engagement. The performance of the XGBoost and GRU-D wearable models, paired with a GRU survey model are shown in Figures 5a & 5b. For an average week, the performance on a disjoint set of participants can be expected to be 0.50 (0-0.74, 95% CI) sensitivity and 0.79 (0.53-0.98, 95% CI) specificity for the XGBoost architecture on the wearable model combined with the GRU survey model. However, given a positive prediction at any date near or during the onset of infection, participants could have been directed to get further confirmatory testing. Any positive prediction leading up to symptom onset could elicit behavioural change such as staying home from work or school, or getting a PCR test. Under this assumption, false negative results that follow positive predictions do not have such a profound impact. We consider a cumulative score automatically classifies a positive score any of the past 7 days has been classified as positive. We hypothesise that this would increase the sensitivity (by decreasing the number of FN) while decreasing the specificity. This nuance in reporting leads to an expected weekly sensitivity of 0.65 (0.19-0.87, 95% CI) and specificity of 0.69 (0.41 -0.97, 95% CI) on prospective test sets on FLUCOVID data. Roughly 63.5% of COVID-19 cases are detected at symptom onset (47.7% for non-COVID-19 ILI), and 68.9% of cases are captured by day 3 of symptoms (55.8% for non-COVID-19 ILI), before many individuals would be compelled to take a PCR test and receive results (See Figure S11 ).

The wearable component for this prediction scheme prompts surveys 65% of the time (a 35% reduction in survey burden as compared to daily symptom reporting). The sensitivity of the wearable-only model (any ILI detection)

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint

(a) XGBoost model tested on 50% of the FLUCOVID dataset. The performance on the wearable portion of the combined model is considerably better than the wearable models for direct COVID-19 prediction that were distinguishing COVID-19 from ILI since the combined approach does not have to differentiate COVID-19 from ILI, and the ILI labels are more frequent in our data set.

The pandemic has a disparate impact on minorities. One meta-analysis found Black and Asian groups were more likely to contract COVID-19 than their White counterparts [58] . These results are in agreement with several other studies for Black [23; 42; 49] , Hispanic, Asian, and North American Indigenous minorities [22] . We look at the performance of the combined wearable and survey model subset by race and gender, where subgroups have more than 1000 members. These results are shown in Table I . Results show that the differences between male and female subpopulations are significant for both specificity and sensitivity, but there is a tradeoff: for the same decision threshold, one group experiences a higher sensitivity and lower specificity, whereas the other group experiences the opposite effect. This pattern is also observed in the black participants where the model sensitivity on black participants is 12% less than the overall sensitivity, but the specificity is 10% higher with the same threshold. In models where the week-of-year is omitted, the sensitivity difference is abridged. Our results also show that the model's sensitivity is significantly higher on the Asian subpopulation compared to the non-Asian population. The sheer number of non-COVID-19 days renders all of the examined subpopulation specificities significant when compared to the rest of the population. The Asian, Black, and Hispanic minority subpopulations have specificities exceeding the population mean, whereas the white subpopulation specificity is below that of other groups. Overall, the model performance does not exacerbate existing inequalities of the COVID-19 pandemic noted in previous studies [23; 42; 49; 58] . However, in specific use contexts it may be essential to leverage post-processing techniques to achieve demographic parity or equalized odds.

Importance of including imperfect data All of the 32, 198 participants in the FLUCOVID study had at least one ILI event and consented their wearable survey data. Participants have varying degrees of device wear time. Some may only take it off to charge it (nearing 100% wear time when acquiring signals for an entire day) whereas others have sporadic usage. Previous studies require a minimum device wear time for their cohort selection [35; 43] , and all participants who do not meet this requirement are dropped. We instead choose models that can implicitly handle missingness. When we gradually eliminate patient-days based on their percent of missingness, we find that separation between ILI and healthy classes diminishes (See Figure S12) . Intuitively, this suggests that including missingness makes the task more . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Table I . Combined wearable and survey model performance for COVID-19 detection by race or by gender for cohorts with more than 1000 participants. Results are reported as score (95% confidence interval). Z-score calculated as difference in binomial proportions (Wald test) to other groups (e.g. Asian vs non-Asian). This includes week-of-year and null survey answers are assumed to be healthy predictions.

difficult. By the nature of the task, abstaining from making a prediction is synonymous with classifying a participant as healthy. In order to compare wearable device model performance to other diagnostic measures, such as PCR tests, missing data cases cannot be neglected in test data. We replace these missing predictions with a classification of healthy when evaluating models on test data. It is important to note that all labels in the dataset have been created using self-reported symptoms, onset dates and recovery dates. When the time between the self-reported date of disease recovery and the survey response is more than one week, model area under the precision recall curve (AUPR) drops considerably corresponding to the reduction in positive cases, however the sensitivity and specificity are not overly perturbed by delays in reporting (See Figure S7 ). The deteriorating AUPR is driven by a decreasing positivity rate.

Reporting performance across the entire study Our reporting differs from existing literature in that we report across the entire timeline whereas other approaches [35; 38; 43; 50] report classification results for the windows around the onset. A machine learning model trained to detect ILI would be expected to have a detectable change in outputs corresponding to an increase in detection around the time of ILI symptom onset. It is tempting to validate the machine learning model by aligning participants in the test set by the date of onset to confirm this. Previous research has demonstrated that model performance will be artificially inflated unless an outcome-independent reference point is used [54] . Prior wearable work calculates AUROC and the ROC is plotted after having discarded a week of data before the reported date of onset, and excludes dates more than 7 days after onset and earlier than 21 days prior to onset [43; 50] . However, even a small false positive rate can completely neutralise the perceived utility of the model for an imbalanced task such as determining ILI days vs healthy days. For posterity, we also aggregate our results in this fashion in Figure 6b . A full comparison of our model performance to published models, under their own evaluation strategies can be found in Table S7 , though we do not encourage communicating model performance according to these metrics as they do not reflect model utilisation.

A similar design was followed where nighttime respiratory rate, RHR, and heart rate variability (HRV) were used to train a model to classify the onset of COVID-19 [35] . To accommodate the class imbalance between sick and healthy days in the training set, sick days were synthetically oversampled by adding noise. The model captured 80% of the COVID-19 positive individuals, but the same model detected 34% of individuals who exhibited symptoms but tested negative for COVID-19. This is understandable as the model was not specifically trained to discern between COVID-19 positive and negative cases. However, this evaluation should be observed cautiously. The number of sick days, and the number of healthy days in the objective are arbitrarily chosen; only days within -30 to -14 range were used in the validation sets as negative labels (For COVID-19, 99% of symptomatic cases will show symptoms within 14 days after exposure [2; 29] ). For those that eventually got COVID-19 the FPR was 4.7% and those who eventually had ILI had a FPR of 5%. The FPR of 4.7% corresponds to a COVID-19 prediction once every 21 days, or 17 false positive calls per year for a healthy individual [35] .

In another study, an online outlier detection method was shown to correlate with a post-hoc/offline outlier detection method on heart rate, heart rate to steps ratio, and sleep features to detect the onset of symptoms for 32 SARS-COV-2 positive individuals [38] . Specifically, their method correctly identified an outlier region within 14 days of SARS-COV-2 symptom onset for the majority of users. When we applied this outlier detection method to a subset of our FLUCOVID cohort, we were also able to find a detection event within 14 days of symptom onset for most individuals. However, the sensitivity of this approach was strongly influenced by the amount of data that was recorded before and after . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint the SARS-COV-2 symptom onset date. For example, when we (artificially) decreased the amount of data by 14 day increments in our study, the symmetric sensitivity (a detection event within 14 days before/after onset) rose from 63% to 100%, and the one-sided sensitivity (a detection event within 14 days before onset) rose from 40% to 61%. Reducing the amount of recorded data around onset increased the positive predictive rate from 27% to 30%, as well as lowered the p-value calculated relative to a null detection proportion from 3% to <1%. Constraining the detection window size over which any ML method is evaluated will artificially increase sensitivity, specificity, and statistical significance. Outlier detection results are unlikely to generalize for small time frames. For COVID-19 the median time from exposure to onset is 4-6 days [2; 29; 65], and 97.5% of symptomatic cases occur within 11.5 days after exposure [29] . If there is exposure earlier than 14 days to prior to symptom onset, it is unlikely to be a true positive. Influenza has an even shorter median incubation time, with only 1.4 or 0.6 days until influenza A and influenza B symptom onset, respectively [31] .

Grouping the predictions by proximity to ILI cases can hide real trends in the data. Calculating the AUROC around onset is highly influenced by the class balance [46] which in this setting is contrived. Prior commentary is available on the attribution of important baselines, such as prevalence [13] . The likelihood of contracting COVID-19 is non-stationary because the transmissability of COVID-19 changes in response to public health measures, the number of infected cases and the circulating strains [1] . Evaluation procedures must be agnostic to these effects. This type of dataset shift is known as prior probability shift [40] , which warrants careful calibration of machine learning models or additional training techniques [52] . In this study we have focused on evaluating our models and communicating model performance that is reflective of deployment under these circumstances.

Account for ILI cases when building a COVID-19 detector It is essential that studies account for ILI cases when building COVID-19 wearable detectors [53] . When applying machine learning models to aid in the detection of emergent diseases we must use training data which are representative of data in a deployment scenario. An individual with influenza may share several characteristics of COVID-19, despite never being exposed to the SARS-COV-2 virus. There is an overlap between COVID-19 and influenza in both symptoms and autonomic responses which cannot be ignored when reporting the specificity of COVID-19 diagnostic tests. Fortunately, these overlapping symptoms present with different frequencies. For example, COVID-19 patients are significantly more likely to have ageusia (loss of taste), fever, anosmia (loss of smell), myalgia/arthralgia (aching body), diarrhea, and nausea [7; 34; 63] . The symptoms which are more prevalent in COVID-19 negative ILI patients are cough and sore throat, though only sore-throat is statistically more likely to be associated with influenza [63] . The associations of these symptoms with COVID-19 have been corroborated in other studies [10; 43] . Symptom based COVID-19 detection is somewhat able to distinguish COVID-19 positive patients from other respiratory illnesses using a simple logistic regression model [6] . However, when determining the chance of a positive PCR COVID-19 test given that a patient presents symptoms, the major drivers of the prediction were travel history and case proximity in an early stage of an outbreak (not fever or cough) [10] . These results will vary based on the richness of the feature set, the accuracy of symptom recording, the size of the cohort, and the progression of the pandemic in the region. Using wearable device data in place of symptom based screening exchanges improved frequency of monitoring with further abstraction from the non specific symptom features.

COVID-19, mild symptomatic COVID-19, and ILI differentiation With COVID-19, mild symptomatic COVID-19, and ILI all concurrently present, it is essential that models can differentiate between them. Differentiating between COVID-19 days and healthy days tends to be the easiest task amongst the subclasses of COVID-19, influenza, and unspecified ILI, alongside the healthy class ( Figure S9 ). It is considerably more difficult to differentiate between unspecified ILI and healthy cases.

Unspecified ILI cases in the FLUCOVID surveys subside at a proportionately slower rate than the CDC's reported values [8] (See Figure 1) . We suspect that these ILI cases may be undiagnosed, or misdiagnosed "mild symptomatic COVID-19". Amongst the 8743 participants who reported ILI symptoms and did not test positive for SARS-COV-2 between April 1, 2020-June 1, 2020, 86% received a negative COVID-19 test within -2 and 21 days after symptom onset (14 days post symptoms +7 days reporting delay). Only 6% of patients who didn't have COVID-19 tested positive for influenza in the FLUCOVID dataset. Several of these included participants who had not been tested for COVID-19. Less than 1% of participants in this time period experienced symptoms, but did not get tested for influenza or COVID-19, have had not previously tested positive for COVID-19, yet reported having close contact with a COVID-19 case within the 14 days prior to symptom onset. This small proportion of untested, COVID-19 contact, unspecified ILI cases equates to roughly 10% of the confirmed COVID-19 cases during the same period. If any of these untested cases are actually COVID-19 it could decrease the specificity in our reported results, as the TPs would be disguised as FPs. This . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint (a) Models evaluated on a retrospective test set.

(b) Models evaluated on a prospective test set. Figure 6 . Positive calls around symptom onset:XGBoost model performance for detecting any ILI or COVID-19 both retrospectively and prospectively and grouped relative to the symptom start date (day 0). Note that elevated percent predicted after day 0 indicates better performance under our labelling scheme.

coincides evidence of individuals in the USA who had been infected with SARS-COV-2 based off of the presence of antibodies, yet had not been reported [3] (though these results must be interpreted cautiously [56] ). These individuals would still be capable of transmitting virus, however if we followed the convention of previous works, unspecified ILI patients would have been excluded from the dataset. The signal that would be learned under that convention would be the joint probability of getting a PCR test and that test being positive for SARS-COV-2 [15] . We want to discern the latter unconditioned on getting a PCR test.

Limitations There are several limitations of this study which would affect deployment-level performance. Most importantly, our training dataset does not contain participants who are healthy throughout the entire duration of the study. Upon deployment, the PPV, NPV, specificity, AUROC, and AUPR will all be affected since these metrics involve the calls made on COVID-19 negative participants. In addition, the participants consenting in these studies may not represent the population which this would be deployed for. Both cohorts are comprised of primarily female, and primarily white demographics. This could be attributed to the selection bias in those who wish to participate, or the selection bias of individuals who own a wearable device. We find that symptom-based screening substantially outperforms wearable device screening in differentiating between COVID-19 cases and non-COVID-19 ILI cases. The limitation of symptom screening is that it negates the ability of the wearable devices to identify positive COVID-19 cases prior to symptom onset. Both models could be improved from enriched features. For example changes in heart rate variability metrics have been shown to be associated with COVID-19 [19; 35; 43] . Our features are summarised on the day level. More granular time intervals could improve identification of ILI signatures. Improvements could be made to the modelling such as using side information from the survey data to improve the representational capacity of the wearable-only model, or a one-time demographic survey could be required so that wearable devices can leverage participant attributes to improve performance. A further reduction on the reliance of surveys could be achieved through objectives that learn to defer to surveys when the wearable-model is uncertain [41] . Specific modelling techniques could be applied to address dataset shift due to frequently changing lockdown status or changing prevalence [52] . The prospective results on this study report performance on a disjoint test set, which is likely to overestimate the performance gap caused by generalisation across time by encompassing the generalisation across participants. Users could potentially benefit from having personalised models to capture outlying personal patterns [64] , though global patterns would need to be leveraged for COVID-19 detection since it does not occur more than once per individual in our datasets. Finally, changes in how these predictions are used could impact their performance. Perhaps tests or treatments are scarce and only allocated to the top-k individuals. In another use case, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint not providing a prediction when there is insufficient data would improve the performance of the model. This could be useful in a context where a negative prediction is required to attend class in person (whereas positive and null predictions would dictate remote attendance).

We collected Fitbit activity data, demographic baselines, and weekly symptom surveys from 33777 participants from November of 2019 through August, 2020. All users reside in the United States. We refer to this dataset as FLUCOVID. The characteristics of this dataset are described in greater detail in the following sections. For details on the collection of data, we refer the reader to our previous work [53] .

Participants consented data for the study which includes age, education, ethnicity, children and relationship status, height, weight, and location. In addition to these demographic variables, comorbidities are available including asthma and COPD (amongst others). Demographic data and comorbidities are not used in training wearable models, because they are not available in the wearable device data. This allows us to cast the widest net in COVID-19 detection (sensitivity) at the expense of specificity. The characteristics of the data can be found in Table II .

Our wearable features contain 48 covariates derived from heart rate monitoring, step tracking, and sleep tracking features for each day (Table S1 ). We compare several data preparations, including regularly vs irregularly indexed data [20] , population normalisation vs. individual normalisation [51] , and forward filling imputation vs. no imputation. A complete description of our data preparation is available in Supplementary A. When the effect is insignificant, or the effect size is small between preparation tasks, we err on the side of including as many patient-days as possible. Unless otherwise specified, we use regular indexing, population-level normalisation, and forward-filling imputation while reporting our results.

Each participant in the FLUCOVID cohort experienced at least one influenza-like illness (ILI) event between November 1, 2019, and June 1, 2020. Of the 32198 participants with both surveys and wearable data in this cohort, 2554 had medically diagnosed influenza, 204 had COVID-19, and the remainder had unspecified ILI. 27% of users reported more than one ILI event in the study. All of our labels in the survey are derived from self-reported symptoms and test results. Each week, participants are asked to recall their symptoms in the past 7 days. They will note which days they experienced flu related symptoms, COVID-19 related symptoms, and complications with chronic diseases. ILI onset and recovery dates are also reported by the participant. In addition to illness, there are questions pertaining to self-isolation, air travel, and household members with ILI or COVID.

We opt for machine learning models that can implicitly handle missingness, namely, we use GRU-D [11] or XGBoost [12] models. Models are exclusively trained on the FLUCOVID dataset such that 35% of participants fall into the training set, 7.5% in the validation set, 7.5% in the held out random retrospective test set, and 50% in the held-out prospective test set (to preserve prevalence). Features are collected on a daily basis, and predictions are made at the same frequency. A complete description of the features, and their preprocessing can be found in Supplementary A. For the XGBoost model, we use the past 7 days of data (inclusive of the prediction day), flattened into a matrix x ∈ R Np×tN f , where N p is the number of participants, t is the number of days, and N f is the number of features. This flattening procedure is rolled across every time-point for each participant. Hyper-parameter selection was performed to minimise the validation set loss whenever a new model is trained. The GRU-D model ingests data in the format of x ∈ R Np×t×N f as it is recurrent. The GRU-D model used the validation set for early stopping. Because of this, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021 Table II . FLUCOVID: Demographics of the Large-scale Flu Surveillance study for survey and wearable cohorts different weights were learned for the model which is stopped on the concurrent portion of the validation set for randomised testing as opposed to the prospective portion of the validation set for prospective testing. The comparisons between randomised and prospective evaluations for the XGBoost model use the same weights in our results whereas the same comparisons for GRU-D do not. The participant splits between XGBoost and GRU-D models are shared, so the performance can be directly compared between the two models. The survey models used a subset of features that are common for COVID-19 screening questionnaires as seen in Table S2 . Notably, several of these features are significantly different between those who tested negative for COVID-19, and those who tested positive. Even still the relative abundances of behaviour and travel are similar. Statistical significance does not promise separability during inference. These features, combined with demographic variables are ingested into a GRU model. Due to the nature of the surveys, days where no answer is provided are assumed to be symptom free. Models are trained only where symptoms are reported, to differentiate between COVID-19 and non-COVID-19 ILI using all symptoms and behaviour patterns observed to date.

Choosing thresholds on all historical validation data does not generalise to the risks that are predicted in the following prospective week. Due to the class imbalance, outcome probabilities (logits) are poorly calibrated. A prospective validation set is necessary to select thresholds which will be calibrated in the next week. Through this process we found the choice of threshold for a given sensitivity on that validation set would better estimate the sensitivity we would observe on the subsequent test set (See Figure S8b ). In our case this can be attributed to inconsistency of the target distributions (prior probability shift) [40] seen in Figure 1 . For randomised test sets, thresholds were drawn on a validation set that was concurrent with the training and test sets.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint Figure S1 . Retrospective vs. Prospective testing setup Feature Name Type Unit Source Description heart rate bpm heart rate bpm FBDL daily resting heart rate sleep seconds sleep seconds FBDL daily total sleep walk steps steps count FBDL daily total steps active Fitbit awake sum wear time minutes FBML daily sum of awake wear-time active Fitbit sum wear time minutes FBML daily sum of wear-time heart rate asleep max heart rate bpm FBML max asleep daily heart rate heart rate asleep mean heart rate bpm FBML mean asleep daily heart rate heart rate asleep stddev heart rate bpm FBML stddev asleep daily heart rate heart rate awake max heart rate bpm FBML max awake daily heart rate heart rate awake mean heart rate bpm FBML mean awake daily heart rate heart rate awake stddev heart rate bpm FBML stddev awake daily heart rate heart rate mean heart rate bpm FBML mean daily heart rate heart rate not moving max heart rate bpm FBML max non-moving daily heart rate heart rate not moving mean heart rate bpm FBML mean non-moving daily heart rate heart rate not moving stddev heart rate bpm FBML stddev non-moving daily heart rate heart rate perc 25th heart rate bpm FBML 25th percentile daily heart rate heart rate perc 50th heart rate bpm FBML 50th percentile daily heart rate heart rate perc 5th heart rate bpm FBML daily resting heart rate heart rate perc 75th heart rate bpm FBML 75th percentile daily heart rate heart rate perc 95th heart rate bpm FBML 95th percentile daily heart rate heart rate resting heart rate heart rate bpm FBML daily resting heart rate heart rate stddev heart rate bpm FBML stddev daily heart rate sleep main start time deviation of the individuals data is buffered by a 7 day gap. This means that on the first day of ILI symptoms, the z score is unencumbered by the most recent 7 days of data to avoid the bias from the viral incubation period. however, illnesses lasting over a week will start to normalise the elevated resting heart rate values. A larger gap, or a longer data requirement period could be administered at the expense of requiring a longer participation period before a prediction can be made. This calculation is shown in equation C1.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint (a) Resting heart rate raw (b) Resting heart rate normalised z-score Figure S3 . Heart rate feature throughout the study (mean ± standard deviation).

(a) The AUROC of the XGBoost model trained with and without week-of-year using both regular and irregular sampling (b) The AUPR of the XGBoost model trained with and without week-of-year using both regular and irregular sampling Figure S4 . Regularly and irregularly sampled data trained on the same participant splits did not substantially affect the performance.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021 Table S4 . Sparse data at regular intervals . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint Figure S5 . Comparison of z-score individually normalised features to population normalised features for XGBoost.

Otherwise, data are normalised by the mean features in the training data as shown in equation C2.

Where x i,train is a vector of all observations of feature i in the training set.

In order for someone to have a feature observed on a certain day for the z-score normalisation, the raw feature must be observed and they must have at least 14 of the last 30 days of data as a source of standard deviation [51] . In contrast, when normalising data by the training population means, participants only need the single raw value in order for that day to be observed. We compare these two methods, referred to as z-score and standard data preparation in Figure S5 . We do not see a large difference in performance, meaning the extra observations available in the standard data preparation regime offset any perceived gain of individual z-score normalisation. This is in spite of the inter-individual variance exceeding the intra-individual variance (See Figure S6b ) as previously observed [35] .

Next, data is imputed using forward-filling. At the earliest time-step for each participant, the training set population mean is applied (for z-score this corresponds to filling with 0), then the most recent measurement is carried forward in all subsequent observations. We retain the mask (whether or not the data was observed at that time-step), the time that has elapsed since the feature was last measured, and the forward-filled data [11] . The examples of regularly and irregularly sampled data are shown in Table S5 and Table S6 , respectively. All pre-processing conditions (regularly vs. irregularly sampled data, imputed vs. non imputed, z-score vs raw data) are supported by the provided code.

The FLUCOVID dataset requires exclusively on self-reported events, including onset and recovery dates in order to form labels. Some efforts are taken to smooth events, like merging two ILI events for a participant that bookcase a healthy period of less than one week. We find that despite long delays in reporting illness, the sensitivity and specificity of the models on these labels are not substantially harmed (See Figure S7) . Fortunately the majority of labels are reported within a 1-week period, and reporting lapses exponentially decay. However, all of our test sets include these participants in performance reporting nonetheless. Using outcomes attained directly from health systems may eliminate this perceived lapse in test performance and it could strengthen signals during training.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Table S5 . Sparse data at regular intervals imputed using forward filling. The mask indicates if the data was observed at that time-point. The ∆t is the time since the last measurement was made. *Where data is not available at the start, it is imputed with the population mean. Table S6 . Sparse data at irregular intervals imputed using forward filling. The mask indicates if the data was observed at that time-point. The ∆t is the time since the last measurement was made. *Where data is not available at the start, it is imputed with the population mean.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint (a) Wearable XGboost model (b) Wearable GRU-D model Figure S7 . Performance of the ILI detection Wearable model, stratified by how far in the past an individual was remembering the event when responding to the survey 2. Prospective validation set matches prospective test set logits better than randomise validation set Appendix D: Comparison to existing models according to their evaluation schemes

We show performances of models as found in their respective citations in Table S7 . Wherever possible we match their evaluation scheme for retrospective test sets, prospective test sets, and XGBoost and GRU-D models. Note, that we do not specifically optimise for these objectives, but instead simply select the appropriate days and calculate the metrics. Table S7 . Comparison of our models to literature with retrospective and prospective evaluations. *86 Controls. ** there was no significant difference between alarm rates for COVID-19 cases and potentially healthy cases, so the TPR∼FPR, or specificity∼0.125 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Figure S8 . The distribution of the validation set must match the distribution of the test set when choosing thresholds.

Mount Sinai [19] * CuSum [38] Scripps [50] FitBit [43] WHOOP [35] Table S8. Methods in existing literature. None of the methods have evaluation schemes that accommodate participants with missing features. *For 13 patients, only 13 symptom days are compared to all other days. It is not clear how many of the symptom experiences overlapped with COVID-19 events. Figure S9 . The wearable model outcome logits around symptom onset date for COVID-19, confirmed influenza, and unspecified ILI. The outcomes are gathered from the XGBoost model on retrospective, randomly selected test set.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint Figure S10 . The gain of the model. Green represents the best-case scenario, blue is the model agnostic performance, and red is the when the data are normalised by the model.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint Figure S11 . Performance of the Covid-19 Wearable+Survey detection models in the FLUCOVID dataset for both XGBoost (left) and GRU-D (right) wearable models aggregated by their time since onset. Cumulative predictions is for a subset of participants because of random sampling of participants each week.

(a) XGBoost model AUROC for a minimal amount of unobserved measurements.

(b) XGBoost model AUPR for a minimal amount of unobserved measurements. Figure S12 . Varying the maximum amount of missing data allowed during testing affects the performance.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 17, 2021. ; https://doi.org/10.1101/2021.05.11.21257052 doi: medRxiv preprint

A guide to R-the pandemic's misunderstood metric

Incubation period of 2019 novel coronavirus (2019-nCoV) infections among travellers from wuhan, china

Estimated SARS-CoV-2 Seroprevalence in the US as of

Household transmission of SARS-COV-2: Insights from a population-based serological survey

Estimating the extent of asymptomatic COVID-19 and its potential for community transmission: Systematic review and meta-analysis

Estimating the efficacy of symptom-based screening for COVID-19

Anosmia and dysgeusia associated with SARS-CoV-2 infection: an age-matched case-control study

CDC's ILINet data

SARS-CoV-2, SARS-CoV, and MERS-CoV viral load dynamics, duration of viral shedding, and infectiousness: a systematic review and meta-analysis

Screening for COVID-19: Patient factors predicting positive pcr test

Recurrent neural networks for multivariate time series with missing values

XGBoost: A scalable tree boosting system

Meaningless comparisons lead to false optimism in medical machine learning

The cardiovascular manifestations of influenza: a systematic review

Collider bias undermines our understanding of covid-19 disease risk and severity

Temporal dynamics in viral shedding and transmissibility of COVID-19

Best practices for analyzing large-scale health data from wearables and smartphone apps

Use of physiological data from a wearable device to identify SARS-CoV-2 infection and symptoms and predict COVID-19 diagnosis: Observational study

Parameterizing time in electronic health record studies

Viral shedding and transmission potential of asymptomatic and paucisymptomatic influenza virus infections in the community

Individual and community-level risk for COVID-19 mortality in the united states

Association of Race and Ethnicity With Comorbidities and Survival Among Patients With COVID-19 at an Urban Medical Center

Fever and Cardiac Rhythm

Access to lifesaving medical resources for african countries: COVID-19 testing and response, ethics, and politics

Digital contact tracing for COVID-19

Variation in falsenegative rate of reverse transcriptase polymerase chain reaction-based SARS-CoV-2 tests by time since exposure

Test sensitivity is secondary to frequency and turnaround time for COVID-19 screening

The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: Estimation and application

Clinical Course and Molecular Viral Shedding Among Asymptomatic and Symptomatic Patients With SARS-CoV-2 Infection in a Community Treatment Center in the Republic of Korea

Incubation periods of acute respiratory viral infections: a systematic review

Review article: The fraction of influenza virus infections that are asymptomatic: A systematic review and meta-analysis

Digital health: Tracking physiomes and activity using wearable biosensors reveals useful health-related information

Real-time tracking of self-reported symptoms to predict potential COVID-19

Analyzing changes in respiratory rate to predict the risk of COVID-19 infection

Rethinking covid-19 test sensitivity -a strategy for containment

Clarifying the evidence on sars-cov-2 antigen rapid tests in public health responses to covid-19

Pre-symptomatic detection of COVID-19 from smartwatch data

Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship

A unifying view on dataset shift in classification

Consistent estimators for learning to defer to an expert

Racial Disparities in Incidence and Outcomes Among Patients With COVID-19

Assessment of physiological signs associated with COVID-19 measured using wearable devices

Assessment of SARS-CoV-2 Screening Strategies to Permit the Safe Reopening of College Campuses in the United States

Large-scale assessment of a smartwatch to identify atrial fibrillation

Reporting accuracy of rare event classifiers

Indicators of retention in remote digital health studies: a cross-study evaluation of 100,000 participants

The effect of control strategies to reduce social mixing on outcomes of the COVID-19 epidemic in wuhan, china: a modelling study

Hospitalization and mortality among black patients and white patients with COVID-19

Wearable sensor data and self-reported symptoms for COVID-19 detection

Harnessing wearable device data to improve state-level real-time surveillance of influenza-like illness in the usa: a population-based study

Learning to reweight examples for robust deep learning

Characterizing covid-19 and influenza illnesses in the real world via person-generated health data

Leveraging clinical time-series data for prediction: A cautionary tale

Feasibility of continuous fever monitoring using wearable devices

Antibodies, Immunity, and COVID-19

False-positive COVID-19 results: hidden problems and costs

Ethnicity and clinical outcomes in COVID-19: A systematic review and meta-analysis

Interpreting a COVID-19 test result

Virological assessment of hospitalized patients with COVID-2019

False negative tests for SARS-CoV-2 infection -challenges and implications

Electronic wearable device and physical activity among us adults: An analysis of 2019 hints data

Association of chemosensory dysfunction and COVID-19 in patients presenting with influenza-like symptoms

Developing personalized models of blood pressure estimation from wearable sensors data using minimally-trained (a) XGBoost model tested on 50% of the FLUCOVID dataset

GRU-D model tested on 50% of the FLUCOVID dataset

Funded by NIH, The National Cancer Institute (NCI) and the National Institute of Biomedical Imaging and Bioengineering (NIBIB) AWARD N. 75N91020C00034.

The day-level features available to the wearable model are listed in Table S1 . We used t-SNE and PCA to investigate the feature dependence on time. We show the results in Figure S2 .

We show the survey features that are available to our survey model for the participants in the FLUCOVID dataset in Table S2 .

We compared several data preprocessing steps: irregular vs. regular sampling [20] , population-level vs. individuallevel normalisation [51] , and forward-filled imputation vs. no imputation. When a day has no observed data, the data is re-indexed to obtain a regular interval (Table S4) , or left alone to retain an irregular interval [20] (Table S3) . Examples for irregular sampling and regular sampling are shown in Tables S3 & S4, respectively. We found that the sampling interval did not substantially affect retrospective model performance on randomised training. validation, and test splits (See Figure S4) .Next, the data normalisation is considered. For the z-score normalisation the rolling average and standard deviation is calculated for each of the patients, and each of the features, across 28 time-points of patient data. For the regularly sampled data, this corresponds to 28 days of data (some of the data will be missing), whereas for the irregularly sampled data, it simply means the previous 28 days in which any measurement had occurred (all of the data will have been observed). We use the mean and standard deviation from the 28 day period to calculate the individual's z-score (only where at least 14 observations are made) similar to the procedure followed in previous flu monitoring pipelines [51] . Our intention is to avoid making predictions based on an individual's deviation from the population average [13] , and instead make predictions based on an individuals own internal consistency. The mean and standard We observed an effect on the performance of the models from missing data due to lack of device wear time. If an individual has not worn the device at all over a given week we see a drop in performance. This motivates work to improve the frequency which individuals in the study wear their devices.