key: cord-0793748-2leg4980 authors: Sudre, C. H.; Lee, K.; Ni Lochlainn, M.; Varsavsky, T.; Murray, B.; Graham, M. S.; Menni, C.; Modat, M.; Bowyer, R. C. E.; Nguyen, L. H.; Drew, D. A.; Joshi, A. D.; Ma, W.; Guo, C. G.; Lo, C. H.; Ganesh, S.; Buwe, A.; Capdevila Pujol, J.; Lavigne du Cadet, J.; Visconti, A.; Freydin, M.; El Sayed Moustafa, J. S.; Falchi, M.; Davies, R.; Gomez, M. F.; Fall, T.; Cardoso, M. J.; Wolf, J.; Franks, P. W.; Chan, A. T.; Spector, T. D.; Steves, C. J.; Ourselin, S. title: Symptom clusters in Covid19: A potential clinical prediction tool from the COVID Symptom study app date: 2020-06-16 journal: nan DOI: 10.1101/2020.06.12.20129056 sha: 7557d933870b23b5c5647f40975ce07a75ab3515 doc_id: 793748 cord_uid: 2leg4980 As no one symptom can predict disease severity or the need for dedicated medical support in COVID-19, we asked if documenting symptom time series over the first few days informs outcome. Unsupervised time series clustering over symptom presentation was performed on data collected from a training dataset of completed cases enlisted early from the COVID Symptom Study Smartphone application, yielding six distinct symptom presentations. Clustering was validated on an independent replication dataset between May 1- May 28th, 2020. Using the first 5 days of symptom logging, the ROC-AUC of need for respiratory support was 78.8%, substantially outperforming personal characteristics alone (ROC-AUC 69.5%). Such an approach could be used to monitor at-risk patients and predict medical resource requirements days before they are required. As no one symptom can predict disease severity or the need for dedicated medical support in COVID-19, we asked if documenting symptom time series over the first few days informs 5 outcome. Unsupervised time series clustering over symptom presentation was performed on data collected from a training dataset of completed cases enlisted early from the COVID Symptom Study Smartphone application, yielding six distinct symptom presentations. Clustering was validated on an independent replication dataset between May 1-May 28 th , 2020. Using the first 5 days of symptom logging, the ROC-AUC of need for respiratory support was 78.8%, 10 substantially outperforming personal characteristics alone (ROC-AUC 69.5%). Such an approach could be used to monitor at-risk patients and predict medical resource requirements days before they are required. However, heterogeneity in disease and presentation is evident, and the ability to predict required medical support 20 ahead of time is limited. In this work, we sought to develop a clinical tool based on the time series of early development of COVID-19 that could be predictive of the need for high-level care in individuals more likely to seek medical help. The COVID Symptom Study is a unique prospective population-based study collecting daily reports of 25 symptoms from millions of users. The smartphone app offers a guided interface to report a range of baseline . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint 3 demographic information and comorbidities (as previously reported (3, 4) ) and was developed by Zoe Global Limited with input from clinicians and scientists from King's College London and Massachusetts General Hospital. With continued use, participants provide daily updates on symptoms, information on health care visits, COVID-19 testing results, and whether they are seeking medical support, including the level of intervention and related outcomes. Case reports have highlighted that COVID-19 infected individuals may present with different symptoms 5 (5-7). We hypothesized that longitudinal symptoms reported during the illness would cluster into distinct subtypes with differing clinical needs and that we could use this information to create a predictive tool for medical support that could be used for resource planning and improvement of COVID 19 patient monitoring. In order to study the time-series of symptom occurrence for the most severe cases for which respiratory 10 support may be needed, clusters of longitudinally reported symptoms were obtained from an unsupervised clustering analysis (8) (see Methods). For our training dataset, we used data obtained from 1653 users of the app with persistent symptoms and regular logging, from disease onset until hospitalization or beginning of recovery, for which the data inclusion cut-off was the 30 th April 2020. An independent replication set was created using separate individuals fitting the criteria with a disease peak from the 30 th April to the 28 th May 2020. Patient selection is 15 detailed online associated with a flow diagram (Sup Fig S1) . The training sample for this analysis comprised 1,653 participants, of whom 383 reported at least one hospital visit and 107 reported respiratory support (defined as ventilation or supplementary oxygen). The independent replication sample consisted of 1,047 participants of which 207 reported a visit to hospital and 59 received respiratory support. Of participants in the independent replication set, 87.8% were from the United Kingdom, 7.5% were from the United States and 4.7% were from Sweden. 20 Prediction of the final cluster into which a participant would fall based on a short reporting period was assessed through tabulation of confusion matrices and weighted precision and recall. A predictive system focused on the need for respiratory support (supplemental oxygen or ventilation) was then built featuring the inferred cluster, the aggregated sum of symptoms and features of individual characteristics using five days of symptom reporting. Both 25 clustering and predictive models were applied to the independent replication set of 1047 individuals. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06. 12.20129056 doi: medRxiv preprint Over the whole set of 2700 selected subjects, a number of demographic and health parameters were associated with higher risk for respiratory support requirement with the following odds ratios (OR) and 95% confidence intervals Bottom: associated Z-Score of presentation of symptoms over overall symptom distribution (red = reported more than average; 20 n nd . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint 5 blue = reported less than average). The clusters are ordered from left to right by rates of reported hospital visit with associated rates of respiratory support of 1.5%, 4.4%, 3.7%, 8.6%, 9.9% and 19.8% respectively. Compared to Cluster 3 -6, of which 8.6% -19.8% required respiratory support, Cluster 1 and 2 represent milder forms of COVID-19 with 1.5% and 4.4% respectively, requiring respiratory support. These clusters showed 5 predominantly upper respiratory tract symptoms and were distinguished from each other by the absence of muscle pain in Cluster 2 compared to Cluster 1, and slightly increased reports of skipped meals and fever in Cluster 2. Cluster 1 had notably lower mean age and BMI than the clusters containing patients with higher likelihood of requiring respiratory support ( . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint 6 Cluster 3 shows stronger gastrointestinal symptoms in isolation (diarrhea, skipped meals) and a relatively reduced need for respiratory support, of 3.7%. However, the associated rate of hospital visit was high compared to cluster 1 and 2. Cluster 4, 5 and 6 included participants reporting more severe COVID-19 with 8.6%, 9.9% and 19.8% of individuals within these clusters requiring respiratory support, respectively. These three clusters represent distinct presentations, with Cluster 4 marked by the early presence of severe fatigue and the continuous presence of chest 5 pain and persistent cough. In turn, individuals in cluster 5 reported confusion, skipped meals and severe fatigue. Finally, individuals in Cluster 6 reported more marked symptoms of respiratory distress including early onset of shortness of breath accompanied by chest pain. These respiratory symptoms were combined with significant abdominal pain, diarrhea and confusion when compared with other clusters. The proportion of frail people was higher in cluster 5 and 6 than in what we consider to be the milder clusters. 10 The ability to predict into which cluster a participant with COVID-19 will fall early in the disease process may enable the provision of adequate respiratory monitoring with pulse oximetry to at-risk patients. We used a confusion matrix analysis (as seen in Figure 2 ) and considered between two to nine days of recorded symptom data to perform the projection to different clusters. We found that after 5 days of reporting, despite 84.8% of the included samples 15 presenting longer time-series in the training set, the error in projection was modest both in the training and the independent replication set. In this 6-class problem, the precision rose from 48. 0% . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint Figure S4 ). At five days, it appeared that headache was the symptom most consistently reported across all clusters (see Figure 10 4), while severe fatigue appeared in those clusters with increased risk of requiring medical support. The duration of confusion was longer in more severe clusters while loss of smell or taste was reported over a longer duration in milder clusters. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint While informative in their own right, we sought to develop a clinically useful tool using these clusters as a feature in a machine-learning-based system for predicting the need for respiratory support in COVID-19. Five days of reporting produced stable symptom clusters allowing for the construction of a predictive system that utilized data collected in the initial five days. The model used the predicted cluster (given 5 days reporting), the aggregated sum of symptoms up to and including that day, and personal characteristics including BMI, age, frailty (PRISMA7 score) 10 and presence of comorbidities. The best model, trained with a 5-fold cross-validation and grid search hyper- . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint 9 larger false positive rate (46. 5 [43.8; 49 .0]), providing a clear argument for the inclusion of symptomatology alongside personal characteristics in prediction models for more severe forms of COVID-19. Our study was limited by the use of self-reported information collected from individuals who used smartphone devices. Additionally, where individuals become too unwell to record their symptoms on the app later in the disease 5 course, time series used in this work may not have accounted fully for the peak of the disease. To address this limitation, reporting-by-proxy was included on the app in late April 2020. National and regional differences in guidelines for hospital admissions and utilization of respiratory support exist, and given the multi-national nature of this study, must be acknowledged. In addition, our model cannot account for silent presentations such as cases of silent hypoxia reported in the literature (11) . It must also be noted that due to the prospective design of the study and 10 changes in population characteristics using the app, the independent replication set was observed to be older than the training set which may lead to slight over-estimation of severity in some younger individuals. The ability to predict medical resource requirements days before they arise has significant clinical utility in this pandemic. If widely utilized, healthcare providers and managers could track large groups of patients and predict 15 numbers requiring hospital care and respiratory support days ahead of these needs arising, allowing for staff, bed and intensive care planning. As a clinical tool, this approach could be implemented at local level, allowing patients to be monitored remotely by their primary healthcare teams with alert systems triggered when individuals demonstrate symptomatology associated with a high-risk cluster. Higher risk individuals could be targeted for increased care to ensure that they do not struggle to access advice when becoming more unwell. For instance, 20 patients who fall into Cluster 5 or 6 at day 5 of the illness have a significant risk of hospitalization and respiratory support and may benefit from home pulse oximetry with daily phone calls from their general practice to ensure that hospital attendance occurs at the appropriate point in the course of their illness. Those in cluster 3 & 4 may also be at high risk, and benefit from proactive care, for example with glucose and electrolyte monitoring. A trigger system could be inbuilt as suggested in other initiatives (12) , alerting these patients at high risk to seek medical attention at 25 a point of specified predicted risk. Additionally, some patients and practitioners may be empowered by a clinical tool into which they could input longitudinal symptomatology and personal characteristics and receive personalised information on risk stratification. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. Fair Allocation of Scarce Medical Resources in the Time of Covid-19 Intensive care management of coronavirus disease 2019 (COVID-19): challenges and recommendations Rapid implementation of mobile technology for real-time epidemiology of COVID-19. Science (80-. ) Real-time tracking of self-reported symptoms to predict potential COVID-19 COVID-19 and Gastrointestinal Symptoms-A Case Report Atypical presentation of COVID-19 in a frail older person Multivariate time series clustering based on common principal component analysis PRISMA-7: A case-finding tool to identify older adults with moderate to severe disabilities Index for rating diagnostic tests Silent hypoxia: A harbinger of clinical deterioration in patients with COVID-19 Rapid Implementation of a COVID-19 Remote Patient Monitoring Program BACKGROUND & SIGNIFICANCE . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint 12 Data and materials availability: Data used in this study is available to bona fide researchers through UK Health Data Research using the following link https://healthdatagateway.org/detail/9b604483-9cdc-41b2-b82c-14ee3dd705f6 5 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 16, 2020. . https://doi.org/10.1101/2020.06.12.20129056 doi: medRxiv preprint