key: cord-0606217-6yga42mq authors: Fyles, Martyn; Vihta, Karina-Doris; Sudre, Carole H; Long, Harry; Das, Rajenki; Jay, Caroline; Wingfield, Tom; Cumming, Fergus; Green, William; Hadjipantelis, Pantelis; Kirk, Joni; Steves, Claire J; Ourselin, Sebastien; Medley, Graham; Fearon, Elizabeth; House, Thomas title: Diversity of symptom phenotypes in SARS-CoV-2 community infections observed in multiple large datasets date: 2021-11-10 journal: nan DOI: nan sha: 6c9c7120ae286b3e7e0b7561a667f455fff0414b doc_id: 606217 cord_uid: 6yga42mq Understanding variability in clinical symptoms of SARS-CoV-2 community infections is key in management of the ongoing COVID-19 pandemic. Here we bring together four large and diverse datasets deriving from routine testing, a population-representative household survey and participatory mobile surveillance in the United Kingdom and use cutting-edge unsupervised classification techniques from statistics and machine learning to characterise symptom phenotypes among symptomatic SARS-CoV-2 PCR-positive community cases. We explore commonalities across datasets and by age bands. While we observe separation due to the total number of symptoms experienced by cases, we also see a separation of symptoms into gastrointestinal, respiratory and other types, and different symptom co-occurrence patterns at the extremes of age. This is expected to have implications for identification and management of community SARS-CoV-2 cases. Since the identification of the SARS-CoV-2 virus, the COVID-19 pandemic has led to over 236 million confirmed cases and 4.8 million confirmed deaths. In response to this, one of the largest ever public health responses has been mounted, with over 6.2 billion vaccine doses administered and a large variety of non-pharmaceutical interventions that have fundamentally changed behaviour and healthcare provision around the world since the start of 2020 (World Health Organization, 2021; Hale et al., 2021; Google LLC, 2021) . Understanding the clinical presentation and course of SARS-CoV-2 infection remains central for transmission control, particularly in determining policy for identification of cases for isolation and tracing of their contacts (Fyles et al., 2021; Crozier et al., 2021) , and potentially to predict clinical outcomes. COVID-19 cases can present with symptoms from a wide range of categories: respiratory, systemic, cardiovascular and gastrointestinal (Struyf et al.) , with high variability between individuals depending on factors such as age and comorbidities (Williamson et al., 2020; Clift et al., 2020) . In addition, a significant proportion of infections are estimated to remain asymptomatic (Buitrago-Garcia et al.) . Assessment of diversity of genotypes and (endo-)phenotypes is a long-standing tool in both infectious diseases and chronic non-communicable diseases, which has been significantly accelerated by modern experimental and theoretical techniques (Hofmann and Zeuzem, 2011; Deliu et al., 2017; Geifman et al., 2018) . In particular, such analysis often helps with the standard process of identifying multiple disease aetiologies with the same presentation, or vice versa a single disease with highly variable outcomes. This latter distinction is particularly important for COVID-19, where different courses of action, including public health interventions, are taken depending on symptom status (NHS, 2021) . Here, we report patterns of symptom occurrence, co-occurrence and clustering in PCR-positive symptomatic SARS-CoV-2 cases -previously considered predominantly in hospitalisation data heavily skewed towards more severe infections (Swann et al., 2020; Millar et al.; Sudre et al., 2021) -in four very large community-based datasets for the time period May 2020 to March 2021 in the UK. Due to this data collection time period, and the effects of vaccination on preventing disease, we expect the datasets contain predominantly unvaccinated individuals. These datasets are diverse in their sampling and data collection methods and include: (a) 1,637,965 symptomatic cases from 'Pillar 2' testing data from the National Health Service (NHS) Test and Trace system, designed to capture cases in the general population; (b) 112,925 symptomatic cases from the Second Generation Surveillance System (SGSS) from England's national laboratory reporting system, which includes cases associated with healthcare settings among patients and healthcare staff; (c) 52,084 symptomatic self-reported cases from the COVID-19 Symptom Study (CSS), which uses a smartphone app associated with https://covid.joinzoe.com/ to collect daily symptom reports; and (d) 9,166 symptomatic cases from The Office for National Statistics COVID-19 Infection Survey (CIS), a longitudinal study of a representative sample of UK households. From each dataset, we extract all PCR-positive individuals, and associate them with symptoms experienced within a time window of the test appropriate for the dataset. More detail about each dataset, data collection and extraction are given in the Supplementary Materials. For the i-th individual and a-th symptom, we let X ia = 1 if the symptom is present during the time window around the positive test and X ia = 0 otherwise. For a dataset with n individuals measuring p symptoms, we can then construct an n × p matrix [X ia ], where the rows of this matrix form a set of n length-p feature vectors for individuals, {y i }, and the columns form a set of p length-n feature vectors for symptoms, {x a }, each of which can then be used as input for unsupervised learning algorithms. In addition to descriptive analysis of the data, we used three complementary approaches to looking at clustering and co-occurrence of symptoms. We first performed hierarchical clustering using complete linkage (Hastie et al., 2009) and the Jaccard distance between symptom vectors D Jac (x a , x b ) = |x a ∩ x b |/|x a ∪ x b | as appropriate for binary data, with results shown in Fig. 1 . This figure shows the matrix of such distances as a heatmap, with a dendrogram to its right. We read these dendrograms from right to left, with splitting points representing points at which the algorithm suggests a separation of symptoms into groups on the basis of their occurrence in infected individuals. Of the plots, the CIS data in panel D shows the clearest signal of separation of symptoms under this analysis method: gastrointestinal symptoms form a separate symptom grouping, joining the rest of the hierarchy only at the highest level; the distinctive loss of taste and smell joins the tree at the next; and the remaining symptoms join individually at remaining levels. In Pillar 2 and SGSS data ( Fig. 1 panels A&B) a similar pattern is observed, except for additional complexity associated with uncommon symptoms in ≤ 5% of positives and for Pillar 2 loss of smell or taste joining at a similar point on the tree to upper respiratory tract symptoms. For the CSS data in panel C, we see that shortness of breath and hoarse voice, symptoms not collected in other studies, appear before gastrointestinal symptoms join the tree. Secondly, we performed Logistic Principal Component Analysis (LPCA), an extension of Principal Component Analysis to binary data (Landgraf and Lee, 2020) . This method is used to project the set of individual feature vectors {y i } for each dataset onto (in our case two) components that sequentially are as close to the original set of vectors as possible. The results of this analysis are shown in Fig. 2 , and show quite strikingly consistent patterns across datasets, despite the various biases and data collection techniques employed. The first strong signal in the data is that the first principal component involves all symptoms in the same direction, meaning that the closest one-dimensional description of community symptoms is number of symptoms experienced. The second principal component, with some exceptions that vary by dataset, suggests that a source of variation is negative correlation between upper respiratory tract symptoms and systemic (Pillar 2, SGSS and CIS) and gastrointestinal symptoms (SGSS, CIS and CSS). The overall interpretation of these results is that a parsimonious description of COVID-19 symptoms at the individual level can be provided by quantifying the total number of symptoms experienced, followed by the relative contribution of upper respiratory symptoms versus systemic or gastrointestinal symptoms to the total number of symptoms experienced. The contribution of upper respiratory versus systemic and gastrointestinal symptoms is also seen and in fact strengthened when examining the age-stratified data (children 0-17 years, adults 18-54 years and elder adults aged 55 years and older, see Supplementary Materials). Having different symptoms identified by taking a symptom-level view of clustering as in the hierarchical analysis, and an individual-level view of co-occurrence as in LPCA, is explained by questions these methods address. LPCA attempts to find a description of overall variation of the symptoms of individuals within the dataset, while hierarchical clustering groups by suitably defined co-occurrence to find natural clusters of symptoms within the dataset. Our third main analysis method aims to provide an overall picture by considering lowdimensional embeddings of the data based on the structure of interactions encoded in the datasets. In particular, Uniform Manifold Approximation and Projection (UMAP) and associated algorithms (McInnes, L and Healy, J,, 2018; McInnes et al.) produces a low-dimensional embedding using local structure of the data (i.e. groups of commonly co-occurring symptoms) and provided the intrinsic dimension of the system is not too large, can capture some of the global structure of the data (i.e. the relationships between such groups of data points). The result is that symptoms which commonly co-occur are placed close to each other in the outputted lowdimensional embeddings. Hyperparameters are important for UMAP, so we performed the analysis for two different hyperparameter choices: one that focuses more on the local structure (shown in Fig. S13 ); and one that focuses less on the local structure and attempts to preserve more of the global structure of the data (shown in Fig. S12 ). To more explicitly compare findings across datasets we extend the UMAP analyses above by using the Aligne-dUMAP algorithm (McInnes et al.) . AlignedUMAP takes several different datasets as inputs and finds the optimal embedding for each inputted dataset, subject to the loose constraint that data points that are shared between datasets are placed in similar positions in the low-dimensional embeddings. These are produced through a trade-off between finding the optimal embedding for individual datasets, and aligning the embedding of shared symptoms across datasets. By aligning embeddings we gain several useful insights, most importantly that an embedding can be directly compared with the others it was aligned against, allowing better assessment of similarities and differences. We produce embeddings of each dataset that are aligned based upon the core symptoms shared by all the datasets in our analysis: cough, diarrhoea, fatigue, fever, headache, muscle ache, and sore throat. These, shown in Fig. 3 , allows us to explore whether datasets shared a common underlying structure of symptom co-occurrence. Inspection of the embeddings with alignment based upon the core symptoms shared by different datasets provides some evidence of a broad structure shared across all datasets. The embeddings produced can be broadly described by a central cluster of systemic symptoms, and cough. Lower respiratory tract symptoms are typically placed nearby, in particular with shortness of breath often being placed close to fatigue. The upper respiratory tract symptoms (sore throat, rhinitis, sneezing) are typically placed further away from gastrointestinal symptoms, with the exception of lost/altered smell or taste symptoms. On most plots, the gastrointestinal symptoms exist as a tail or are slightly separated from the main central group of systemic symptoms. This complements the LPCA analysis, which suggested that individuals separate between those who experience upper respiratory tract symptoms, or those who experience a mixture of systemic and gastrointestinal symptoms. As we did with hierarchical clustering and LPCA, we stratified each dataset based upon age bands that represent children and adolescents, adults and elders, and produced aligned embeddings for ease of comparison (see Supplementary Materials). However, AlignedUMAP allows us to directly compare more embeddings than is possible for dendrograms or symptom loadings as there exists explicit relationships between the embeddings. We perform an additional analysis where we again age-stratify each dataset into 10 year strata and produce aligned embeddings. These embeddings can then be visualised in a 3-dimensional space to describe how patterns of symptom co-occurrence change as age increases, see Fig. 4 , where linear interpolation has been used to connect the different embeddings from each ten year age-strata. Across all datasets, we observe changes to the local structure, indicated by the splitting of the rope/ribbon-like structures for the youngest age strata (under 10 years old), and for the older age strata (around 70 years old). The changes indicate that, despite the attempt to align symptoms in adjacent embeddings, the symptom-co-occurrence patterns of the data have changed too substantially for that to be achieved. This is clearest in the CIS dataset, where some gastrointestinal symptoms (diarrhoea, nausea/vomiting, abdominal pain) are separated out from the main body of symptoms for the youngest and older age-strata. In Pillar 2 and SGSS, we find the formation of new clusters of symptoms, in the older age strata with a first cluster containing vomiting and nausea, and a second cluster containing headache, sore throat, muscle ache and joint pain. For the CSS dataset, separation into two main symptom clusters is observed, with one cluster containing: abdominal pain, muscle ache, headache, sore throat, chest pain, and cough, and with the second cluster containing loss of appetite, altered/loss of smell, diarrhoea, hoarse voice, with slightly separated shortness of breath, fever, delirium and fatigue. For the under-10s, the produced embeddings typically consist of small clusters of symptoms. The CIS dataset is the exception however, again separating out gastrointestinal symptoms from the main body of symptoms. Inspection of the Jaccard distance matrices for the youngest age strata suggests that a possible explanation may be that fewer total symptoms are reported for young children. The observed clusters in the embeddings appear to consist mainly of pairs, or triplets of symptoms that do commonly co-occur, e.g. rhinitis and sneezing. However, the level of co-occurrence between these distinct small clusters is very small leading to separation in the low dimensional embeddings. In summary, we have shown that considerable complexity and variation exists in COVID-19 symptoms in community infections. We find that the primary source of variation is in the number of symptoms experienced by a case, but conditional on this there are various ways to be ill that provide a more fine-grained description of phenotypes. In particular, we find evidence for separation between upper respiratory and systemic symptoms, both including commonly reported symptoms, and between upper respiratory and gastrointestinal symptoms, though the latter are less common overall. While the deep structure of symptom clustering was similar across the middle range of age groups, we found some evidence that patterns of symptom reporting changed among the youngest and oldest -though further work may be required to understand whether this is due to symptom reporting differences, or differences in the symptoms experienced. While there are some differences in our findings across the four datasets, this is unsurprising given their very different case sampling designs, data collection methods, symptom reporting windows and specific symptom data collected. Routinely tested cases, for instance, will be selected based on the symptoms that qualify cases for testing (Pillar 2) leading to lower expected variations in the presence of these symptoms compared to cases identified via random sampling. Indeed, the broad consistency of findings across these datasets, which derive from routine, representative household and participatory surveillance methods respectively, increases our confidence that our findings are robust. Our findings have implications for the evaluation of symptomatic testing criteria in the community, school settings, and high risk settings such as care homes or hospitals. The existence of phenotypes would suggest that the one-size-fits-all criteria used in the UK may be sub-optimal in these sub-populations where multiple phenotypes are plausible. Differences by ages could imply that symptomatic testing criteria should be tailored for different settings, though this would need to be balanced with what is feasible and understandable for the public. Studies that examine the optimal combination of symptoms to initiate testing of symptomatic community cases (Elliott et al., 2021; Fragaszy et al., 2021) may be implicitly assuming the existence of a single phenotype -to ensure that a symptom testing criteria is optimal, the possible existence of multiple phenotypes and the wide spectrum of disease must be considered. Several of the datasets in this study include only positive SARS-CoV-2 cases, a result of which is that we cannot evaluate the specificity of symptom testing criteria combinations informed by the symptom co-occurrence structures we have identified here, and this would limit our evaluation of symptomatic testing policies. Emphasis should be placed on the extent of symptom variation across COVID cases in communication with the public. This messaging is critical for the initiation of transmission control interventions including isolation, testing and contact tracing. Other studies have found that adding additional specific symptoms to the criteria for community symptomatic testing in the UK could potentially include a wider set of cases to be eligible (with implications for increasing testing demand) (Elliott et al., 2021; Fragaszy et al., 2021) . However, surveys have also found that a large proportion of the public is unaware of the existing symptom criteria (Smith et al., 2021) , so messaging focusing strongly on this variation could improve detection of cases and control of transmission, alongside a testing and isolation policy adapted to evolving epidemic circumstances (Crozier et al., 2021) . Further, it may be the case that the different characterisation of cases could inform clinical outcomes, for example the finding that cases can be described by the contribution of upper respiratory symptoms versus systemic or gastrointestinal symptoms to the total number of symptoms experienced. With increasing vaccination, re-infections and ongoing SARS-CoV-2 evolution, as well as the resurgence of other previously suppressed respiratory infections, understanding the variability of COVID-19 symptoms presentation is critical in planning community intervention for control of transmission, identification of cases potentially requiring greater care, and possibly understanding long term presentation of the disease (Antonelli et al., 2021) . Beyond even the current pandemic, the application of unsupervised learning analyses such as this one, in conjunction with clinical, epidemiological and behavioural understanding, is likely to yield important insights for other infectious diseases. Author contributions: All authors contributed to collection and processing of data, choice and implementation of analysis methods, and writing of the paper. Data and materials availability: Datasets are too sensitive for public release, and can be accessed by researchers through secure research environments. SAIL acknowledgement: This study makes use of anonymised data held in the Secure Anonymised Information Linkage (SAIL) Databank. ONS acknowledgement: This work contains statistical data from ONS which is Crown Copyright. The use of the ONS statistical data in this work does not imply the endorsement of the ONS in relation to the interpretation or analysis of the statistical data. This work uses research datasets which may not exactly reproduce National Statistics aggregates. We would like to acknowledge all the data providers who make anonymised data available for research. Code and datasets that have been approved for publication are available at: https://github.com/martyn1fyles/CovidSymptomsAnalysisPublic . AlignedUMAP embeddings of SARS-CoV-2 symptoms. For each dataset, an optimal embedding of the symptoms into 2D euclidean space is found, subject to the following loose constraint: if a symptom is common to all datasets, then it should be placed in roughly the same position across all datasets. This alignment allows for easier comparison, and investigation of shared symptom structures across all datasets. Point size is proportional to the proportion of cases that develop a given symptom. Symptoms that are common to all datasets, and are aligned between distinct datasets are plotted as triangles. For this embedding the parameters were chosen to capture more of the global structure of symptoms and produces less well defined clusters. a. Pillar 2, b. SGSS, c. COVID Symptom Study, d. COVID-19 Infection Survey. . AlignedUMAP embeddings of SARS-CoV-2 symptoms across several datasets. Each dataset has been age-stratified into strata of length 10 years. For each strata, an optimal two-dimensional embedding into Euclidean space of the symptoms is found, subject to the loose constraint that each symptom is placed in a similar location in adjacent embeddings. Linear interpolation is used to connect the embedding of each strata, allowing for a 3-dimensional visualisation of how the co-occurrence patterns of symptoms change with age. a. Pillar 2, b. SGSS, c. COVID Symptom Study, d. COVID-19 Infection Survey. d COVID-19 Infection Survey. Subplots denoted by 2 are a 90 degree clockwise rotation of subplots denoted with a 1 This is a secondary data analysis of SARS-CoV-2 PCR-positive cases from the general community in the United Kingdom between May 2020 and March 2021. We extract data about positive symptomatic cases from four diverse datasets: two datasets from the NHS Test and Trace routine testing and contact tracing programme; the Office for National Statistics COVID-19 Infection Survey (CIS), a population-representative survey of randomly selected households; and the COVID Symptom Study (CSS), a participatory surveillance mobile app project. A.1.1 NHS Test and Trace routine testing data NHS Test and Trace data is further split into two parts: Pillar 2, cases detected in the community, usually on the basis of symptoms to initiate testing; and the Second Generation Surveillance System (SGSS), for people tested in healthcare settings. In May 2020, the UK government made PCR testing available for individuals who had one of the following symptoms: a new, continuous cough; fever; loss of taste or loss of smell. These tests are reported through Pillar 2, through which several different avenues to testing are available. Individuals can book a test appointment through a government website for either a drive-in or walk-through testing centre, where they self-swab their nose and throat (under some supervision, with an adult carer conducting the swabbing for children), with the swab then sent to a lab for PCR testing. Alternatively, individuals can order home test kits where they self-swab at home and post the kit back, with the swab again sent to a lab for PCR testing. If the individual tests positive, their case is transferred to NHS Test and Trace who contact cases to inform them of their result and ask them to conduct a questionnaire including symptoms experienced. The questionnaire is conducted either via a web-form or over the phone with a trained contact tracer. Since the end of 2020, Pillar 2 has also included positive cases identified using rapid antigen tests among people not experiencing one of the PCR test prompting symptoms. These tests also use a nasopharyngeal swab and are conducted at the home, workplace or school and if positive are requested to be followed up by a confirmatory PCR test (though policy has varied over time). Reported positive cases from asymptomatic testing are also followed up by NHS Test and Trace. The Second Generation Surveillance System (SGSS) dataset includes people who test because they work in or have been tested in a healthcare setting as a patient. This latter group includes both those in hospital because of severe COVID-19 symptoms, but also those in hospital for other reasons but receiving SARS-CoV-2 testing. Thus there are likely on average to be more severe cases in the SGSS versus Pillar 2 data, but not exclusively. Again, individuals are swabbed and PCR tested, with their case transferred to NHS Test and Trace if testing positive for symptom reporting and contact tracing. The CSS is a participatory surveillance study collecting data via a smartphone app. It is led by Kings College London and Zoe Global Ltd and was initiated in March 2020 in the United Kingdom and the United States (Drew et al., 2020) . Individuals are asked to report daily whether they are feeling 'physically normal' that day and, if not, what symptoms they are experiencing. As well as demographic data that is collected upon sign-up, participants are also asked to self-report whether they have had any tests for SARS-CoV-2 infection and, if so, the date of test and its result. Demographic data and data about underlying conditions are collected at first registration. Participants can also proxy-report for children or for others they care for (e.g elderly adults they care for). As well as enabling individuals to self-report COVID-19 testing that they have undertaken via the UK's routine testing programmes or surveillance studies, the CSS invites individuals to complete a PCR test via routine testing if they 1) have made at least one report of no symptoms in the previous week and 2) report a new symptom not on the list to prompt symptomatic testing (e.g sore throat). This means that we might expect the CSS reporting to be less dominated by the symptoms required to initiate symptom-based testing than the Pillar 2 routine testing dataset. The CIS is a UK population-representative survey of households randomly selected continuously since April 2020 from address lists and previous surveys (Pouwels et al., 2021) . Households are followed longitudinally with weekly visits for the first month and monthly visits for 12 months from enrolment. A fieldworker attends enrolled households each visit for testing for household members aged 2 years and above and to conduct an interview including, among other topics, demographic data (reported at first visit) and symptoms experienced over the previous 7 days. At each visit, participants conduct a nose and throat swab under supervision of a fieldworker. These swabs are sent for PCR testing and the result communicated to participants. At the same time as swabbing, all participants are also interviewed by the fieldworker to complete a symptom questionnaire. The dates over which cases are collected from each study are shown in Table S1 and in total cover the period from April 2020 to March 2021, with the largest overlap between November 2020 and January 2021. We make no exclusions based on age or other characteristics. From each dataset, we include only cases who report at least one symptom within the symptom reporting window around a positive test (detailed below and listed in Table S2 ). For NHS Test and Trace data, positive cases who are never reached and interviewed post-testing are not included in this dataset. The definition of 'symptomatic' necessarily varies across the datasets because there are differences in the full list of symptoms asked about. Symptoms that were not core dataset variables and were instead recorded by manual entry were not included. For each dataset, we chose to include all dataset symptoms from each study (except for 'write-in' symptoms), rather than excluding symptoms that were not common across all. This was with the intention of maximising the amount of symptom information available for analysis. We also extracted demographic information. It is expected that symptom data from the same PCR-positive cases is captured across the NHS Test and Trace, CIS and CSS datasets. Explicit deduplication of individuals across datasets was not performed but is expected to have no impact on the findings. The proportion of symptomatic cases varies significantly between datasets, reflecting their different sampling. NHS Test and Trace Pillar 2 and CSS both have the highest number of symptomatic cases, which is not surprising given that both datasets mainly focus on symptom initiated testing. The NHS Test and Trace SGSS dataset has the next highest proportion of symptomatic cases. We expect to see some asymptomatic screening in SGSS populations, which may explain the decrease in symptomatic cases when comparing Pillar 2 to SGSS. In the CIS study, we see a much smaller proportion of symptomatic cases, likely due to the sampling strategy being independent of symptoms, therefore resulting in asymptomatic and pre-symptomatic individuals testing positive and being included in the study. The Pillar 2 routine testing contributed by far the largest number of cases to the study, with CIS the fewest. While all datasets contained a slight female majority, with CSS the largest (61.8%), there was some variability in the age distribution of cases ( Figure S1 ); Pillar 2 routine testing was the youngest, while SGSS included the oldest groups. This is likely to be because SGSS more heavily represents a hospitalised population. CSS and CIS are UK-wide, while NHS Test and Trace data contains cases testing in England. The characteristics of the infected sub-population relative to the general UK population have likely changed over time for a multitude of reasons; different levels of restrictions and lockdown across different localities, vaccination coverage and uptake, varying prevalence, weather, levels of outdoors mixing, incentives to ignore social distancing, workplace/school closures and changing availability of testing. Moreover, each study and route of data collection results in different samples of the infected population. The CIS is a population-based household sample and thus should be broadly representative (participation biases aside), those discovered through routine testing (NHS Test & Trace) may over represent a population adherent to testing guidance, those prone to more severe infections and the sub-populations with the highest prevalence and testing-seeking behaviour. Table S1 : Descriptive statistics of the population in each dataset Figure S1 . Histograms showing the age density for each dataset CSS's app based reporting, then the sub-populations with high levels of smartphone ownership and compliance are likely to be over-represented. Data is collected at the level of the symptoms experienced by an individual, and for the majority of datasets we have a binary outcome of whether an individual experienced a symptom or not. Exact symptom questions and lists are given in Table S2 . In CSS an individual is able to choose from several levels of fatigue: "none", "mild" or "severe". Our planned analyses are designed to work with binary data, and as a result we map multiple levels into a binary outcome variable. When performing this mapping, we choose to merge levels together, with the aim of making the symptoms as comparable as possible to what is reported in other datasets. Datasets with a binary fatigue variable report 40-60% of cases, which is consistent with most cases only reporting severe fatigue; if we included mild fatigue then we find that close to 80% of cases report fatigue which is inconsistent with what is reported in the other datasets. The symptom reporting window and its timing relative to the positive test varies across the datasets. For CIS, participants are asked about symptoms in the previous 7 days prior to testing. For cases contacted and interviewed by NHS Test and Trace (Pillar 2 and SGSS), individuals are asked to report symptoms that they are currently experiencing. For CSS, individuals are prompted to report symptoms daily but for this dataset we include all symptoms reported in the 14 days before and 14 days after the date a positive test is reported (note this does not mean that all participants report symptoms with that level of frequency). From the time of infection, individuals usually have a few days before they become symptomatic, while test sensitivity also varies over the course of infection, peaking around the time or just before symptom onset. Previous studies have also found patterns in the types of symptoms that present earlier versus later in the course of an infection (Sudre et al., 2021) . Across each of the included datasets, the time in an individual's infection at which they are tested on average and over which they are asked to report symptoms varies. For CIS time of testing over the course of an infection should be random over the period at which someone will test PCR-positive; for data primarily from symptomatic testing it should be a few days post symptom onset (reflecting a delay between onset and testing, test result and follow-up interview with Test and Trace). For CSS, the time of testing for many will reflect symptomatic testing in the community and some proportion of individuals with particular symptom reporting patterns are asked to obtain a test through NHS Test and Trace symptomatic testing routes. Sore throat Sore throat Do you have a sore throat? Vomiting Vomiting -- To aid interpretation we classify symptoms according to their clinical characteristics. These classifications were made a priori in consultation with an infectious diseases clinician (TW) with experience of caring for people with COVID-19 and without input from observed clustering patterns. We included systemic symptoms, lower respiratory, upper respiratory, gastrointestinal, altered state symptoms and 'other' symptoms that did not fit into any of these categories. The secondary analyses described in this paper received ethical approval from the London School of Hygiene and Tropical Medicine (22752) We describe the frequency with which each symptom was reported in each dataset, categorising them using our symptom classification. We then perform three unsupervised learning techniques, each with a different but complementary aim. Our goal is to understand patterns of symptom co-occurrence and if there is any evidence of symptom clustering, as multiple distinct clusters would be evidence for the existence of distinct COVID-19 symptom phenotypes. We use a variety of methods to understand the behaviour of symptoms, and the analyses are sometimes performed on the Jaccard distance matrix of symptoms. The Jaccard distance is defined as where x i is the feature vector constructed from the presence or absence of symptom i in cases. The simple interpretation of Jaccard distance is then, the proportion of cases who experienced both symptoms i and j, given that they experienced at least one of symptoms i or j. In the case of missing data, the Jaccard distance is computed using only the subset of individuals for which there is no missing data for either symptoms i and j. Hierarchical clustering starts with a set of symptoms, and the feature vector for each is constructed from their presence or absence in individuals with a positive test and report of at least one symptom (i.e. those positive cases not excluded as asymptomatic). The Jaccard distance is used as an appropriate metric for such binary data. Clusters of symptoms are agglomeratively joined on the dendrogram produced on the basis of the maximum distance between cluster members (called 'complete linkage'). Symptoms with a low shortest distance between each other on the final dendrogram tend to co-occur, and those with a long distance are not often both present. Clusters can also be identified by 'cutting' the dendrogram at a given distance. Logistic PCA is an extension of principal component analysis (PCA) to binary data, and reduces the dimension of the symptom space in a manner that preserves the maximum level of variance between individuals (rather than symptoms) (Landgraf and Lee, 2020) . The projection values of symptoms onto lower dimensional basis are called loadings, and these demonstrate the directions in which individual phenotypes most commonly vary. In practice, the first component is likely to have relatively even contributions from each of the symptoms, and will represent an overall severity of illness at the individual level, with subsequent components demonstrating more subtle ways in which symptoms can vary. Given an n × d data matrix X = [x ij ], our aim is to find a low dimensional representation of the natural parameter matrix Θ = [θ ij ], where P(x ij = 1) = θ ij . This is achieved by findingΘ k , a rank k approximation to Θ such that the Bernoulli deviance, D(X;Θ k ) is minimised. This is conceptually related to logistic regression models, as these also attempt to minimise the Bernoulli deviance. In practice, the minimisation is solved over U ∈ R k×d , such that U U T = I, withΘ k = U U T X. The column vector of U are the loadings of the principal components. As with all dimensionality reduction techniques, we need to choose the number of dimensions in our lowdimensional approximation. We follow the recommendation of Landgraf and Lee (Landgraf and Lee, 2020) , and examine the change in the Bernoulli deviance as we increase k. Consider a rank 0 approximation, wherê Θ 0 = 1 nμ T for µ ∈ R n . That is to say that the natural parameter matrix contains a constant value in every column. This is treated as the null model, to which all other models are compared against. For a model with k components, the proportion of Bernoulli deviance explained relative to the null model is given by If k = d, then D(X;Θ k ) = 0, as the model is saturated andθ ij = x ij , this resulting in P (d) = 1. This means that P (k) can be interpreted similarly to standard PCA, in the sense that 100P (k)% of the variance is explained by the first k components. The marginal Bernoulli deviance, M (k), is defined as the change in the Bernoulli deviance explained by adding the k th component, for k ≥ 1, defined as When selecting the number of components in our low-dimensional representation of the data, we primarily focus upon the marginal Bernoulli deviance, and aim to find the largest k such that for k > k, the marginal Bernoulli deviance decreases rapidly. We also examine the proportion of Bernoulli deviance explained -if this gets close to 1, then that suggests we have selected too many components and are over-fitting. In practice, there are in fact two hyperparameters that need to be chosen for logistic PCA: the number of components k ∈ N, and m ∈ N which controls the magnitude of the loadings. The optimal choice of m varies depending upon k, and is selected by leave-one-out cross validation for a range of proposed m values. An example of the model selection is plotted in Figure S2 . In the plotted example, we would choosek = 2, indicated by the vertical dashed line. This is due to the first two components, having a significantly higher marginal Bernoulli deviance than all models with k > 2 components. The marginal Bernoulli deviance's for models where k ∈ {3, . . . , 8} have small differences between successive values of k, making it hard to favour one model over the other. For k > 8, the marginal Bernoulli deviance does decrease rapidly -however, at this point we have explained close to 100% of the Bernoulli deviance, and are over-fitting the model at this point. Hence, for this example we choosek = 2. If we select k = 2 components, then approximately 33% of the Bernoulli deviance of the saturated model is explained. In classic PCA, ideally the model would find a number of components that explains as close to 100% of the variance while not over-fitting to noise. In Logistic Principal component analysis however, if a model explains close to 100% of the Bernoulli deviance relative to the null Model, then this is indicative of dramatic over-fitting. For example, the saturated model where k = d will exactly reproduce the input data and explains 100% of the Bernoulli deviance. The true natural parameter matrix will not explain 100% of the Bernoulli deviance, as it tells us about the probability of a symptom occurring, this will lead to a non-zero Bernoulli deviance. As such, our goal is not to explain 100% of the Bernoulli deviance, relative to the null model. Model selection plots for the number of componentsk can be found in the code repository for this paper (https://github.com/martyn1fyles/CovidSymptomsAnalysisPublic). For some plots, the model selection Figure S2 . a, the proportion of the Bernoulli deviance explained using an LPCA model with k components. b, the proportion of the Bernoulli deviance explained by adding the k th component to the model. In this example, we would select k = 2 as the true number of components, as indicted by the vertical dashed red line. process would suggest that the optimal number of components isk = 1. During the model selection process, we find thatk > 2 is rejected. Throughout this paper we have opted to present LPCA results where we takek = 2, given that several datasets show clear and strong signals thatk = 2 we reason that is it likely that the true number of components is k = 2 across all datasets. However, we acknowledge uncertainty in the model selection, and make available in the code repository LPCA plots where we have takenk = 1. Unlike in traditional PCA, LPCA components are dependent upon the total number of components selected, and as such PC1 for example differs depending upon the total number of components selected. This is why we must rerun the analysis when we takek = 1, and present these results separately to results produced where we setk = 2. UMAP (Uniform Manifold Approximation and Projection) is a technique for dimensional reduction of complex data based on pairwise distances between symptoms. In contrast to the other methods, it is designed to achieve good separation between unknown classes in the low dimensional space, and as such complements the other machine learning methods used above. We compute the Jaccard distance matrix for the observed symptoms, and configure UMAP to attempt to place the symptoms in a 2-dimensional Euclidean space such that symptoms with a smaller distance between them are considered to be more similar. To enable comparison across the datasets and across age strata, we present results using the AlignedUMAP algorithm, where we align the embeddings between: 1) different datasets, aligning core symptoms common across datasets; and 2) for each dataset, across ten year age strata. As AlignedUMAP necessitates some trade-off between finding the optimal embedding and performing the alignment, we complement the analysis by also producing non-aligned embeddings of the datasets. There are a wide number of UMAP hyper parameters that can be adjusted, and finding the optimal combination is not a solved problem to our knowledge. We have opted to produce two UMAP outputs for each dataset, one configured to produce a tight clustering of symptoms, and another configured to produce a loose clustering of symptoms. This is achieved by changing the number of neighbouring points that UMAP considers when it is placing points, similar to the k-nearest neighbours algorithm. As a result, smaller values of the n neighbours parameter will configure UMAP to focus on local structures, and it may not capture the global structure -this produces what we refer to as a tight clustering and produces well separated clusters. Setting the n neighbours parameter to higher values will configure UMAP to focus less on the local structures of the symptoms, but produce a more general clustering of the data -this produces what we refer to as a loose clustering. Both loose and tight UMAP embedding will capture different parts of the symptom topology, and produce complementary analyses. When we produce the loose UMAP embeddings that focus more upon the global structure of the data, we take Table S3 : Sample sizes for each strata of the AlignedUMAP embeddings in the main paper n neighbours = 4, and for the tight UMAP embeddings that focus more upon the local structure of the data, we take n neighbours = 2. When using AlignedUMAP for the fine age strata, we align each slice of data with the two prior slices, and the two slice post the current slice. Aligning with fewer slices produced plots with less smoothing, and aligning with more slices obscured signals in the data (e.g: under 10's might be aligned with those much older than they are, who might have significantly different symptom occurrence patterns). For the broad age strata, all three strata were aligned. When performing alignment, there is a regularisation parameter that controls the 'strength' of the alignment. To select the value of the regularisation parameter, we initially started off with values that produced little to no alignment, and increased the strength of the alignment until visual inspection suggests that datasets are aligned, without being over-aligned. We also explored varying the min dist parameter which controls the minimum distance between points in the produced embeddings -an assumption of the algorithm is that two points cannot have 0 distance between them. Overall, we find that this parameter does not change the structure of embeddings, and largely produces visual changes useful when producing plots without significant overlap of points. We first run hierarchical clustering, LCPA and AlignedUMAP for all the included cases from each dataset, and then repeat them stratified by age group (0-17 years, 18-54 years and 55 years and older). For only three age strata, it is not necessary to plot the results in 3D space as was required for the results from the finer age strata shown in Fig. 4 . This produces a complementary analysis to the results from the main paper. Because of the different sampling of positive cases and resulting sample composition, data collection methods, and symptom questions across the datasets, we expect potential differences in findings arising from: Sampling: The majority of routinely detected community cases in the UK were detected via symptom-prompted tests, particularly prior to the widescale availability of rapid antigen testing for asymptomatic individuals in the Spring of 20201. Thus we expect Pillar 2 to over-represent individuals with at least one of cough, fever and loss of taste and/or smell. This bias is also likely to exist within the CSS as a majority of self-reported tests would also have been performed because they met the symptom criteria for routine community testing, though the study also invited a proportion of regular app-user participants to test based on reporting other symptoms. These biases are not present within the ONS study sample. Data collection method Across all datasets, symptoms are assessed via self-report, including fever. The experience of symptoms and their description is likely to vary across individuals and across demographic characteristics, such as by gender, ethnicity, region, and age. People are likely to report symptoms differently whether they are doing so via an in-person interview, a weekly or bi-weekly survey or via a daily symptom tracking app, and the design of the app or questionnaire interface, as well as the preceding questions will likely affect reporting. The majority of studies examining the efficacy of symptom self-report have focused on psychiatric disorders. These have generally found agreement between patient self-report and clinician assessment, although this varies from 60% to 90% (Lyu et al.; Silverstein et al.; Chan et al.) . In major depressive disorder, self-reported symptoms are more severe than clinician assessed symptoms (Lyu et al.) . When self-tracking for health and fitness purposes, BMI is systematically under-reported (Wilson et al.) . Knowledge of test status could also affect symptom reporting, though this will be less of an issue in the CIS dataset, where individuals will not yet have received their test result. Some studies involve reporting on behalf of others, particularly children or adults receiving care, and communicating the subjective experience of symptoms might be challenging in these cases. When reporting symptoms related to cancer treatment, a dyad (parent and child) approach to reporting symptoms was found to be more effective and preferable to child self-reporting or parent proxy reporting alone (Tomlinson et al.) . The symptom reporting window around positive test time varies across the different datasets. There is evidence from previous studies (Sudre et al., 2021) that some symptoms tend to appear earlier in infection while some appear later. We also know that people who test negative, who are not included in this dataset, report a wide range of symptoms that are not related to SARS-CoV-2 infection (Elliott et al., 2021) ; widening the symptom reporting window around a test date might include symptoms that are non-specific to the SARS-CoV-2 infection. Our approach collapses across time and these variations in the reporting window could affect our findings regarding symptom frequency and clustering. While there is not a way of varying this for the routinely collected NHS Test and Trace data, we do conduct sensitivity analyses to examine a wider symptom reporting window around the day of testing for the ONS dataset, making it more comparable to CSS. We arbitrarily define positive episodes as a new positive occurring more than 90 days after an index positive or after 4 consecutive negative tests, and consider symptoms reported in [-7,+35] days around the index positive. We do not find that this wider symptom window affects our clustering and co-occurrence findings. Epidemic phase The characteristics of cases differ over the course of the epidemic, for example by age, region, socioeconomic characteristics or variant of SARS-CoV-2 infection, which in turn could plausibly affect the symptoms experienced and the likelihood that they are reported. Some positive cases could be from single or double-vaccinated individuals, particularly from later time periods in Winter/Spring 2021. Similiarly to our AlignedUMAP embeddings for age-stratified data, we could also produce AlignedUMAP embeddings for timestratified data, allowing us to investigate how symptom co-occurrence patterns change over time. This would be of particular interest as vaccination effects build, or as a new variant with a different disease profile becomes dominant. The requirement of such an analysis is that each time-strata has a sufficient number of points such that the estimated Jaccard distance matrix is not subject to significant uncertainty. An initial exploration of this analysis was performed for Pillar 2 and SGSS datasets, by stratifying into week-long strata however no significant changes to the symptom co-occurrence patterns were observed during this time period. All datasets include only cases reporting at least one symptom for these analyses. The most commonly reported symptom across all datasets was headache, with approximately half of cases in the Pillar 2, SGSS and CIS datasets reporting them, and almost two thirds of those from CSS, (Fig. 3) . The frequency of systematic symptom reports is high across the datasets. Fever, a systemic symptom intended to prompt isolation and testing in the UK, was experienced by less than one third of all symptomatic cases. Cough, another isolation and testing initiating symptom (when new and continuous, which was not captured in these datasets) was also common (39% to 59%). NHS Test and Trace did not include any other lower respiratory tract symptoms, but in CIS, shortness of breath was experienced by 24% and by 5% in CSS, while 26% of those symptomatic cases participating in CSS reported chest pain (not collected in other datasets). Each dataset includes information about altered/loss of smell and/or taste but collected this differently, though all variations were commonly reported. Altered/loss of smell was most frequently reported in the CSS (52%), while loss of taste and smell separately (CIS) and in combination (NHS Test & Trace) was reported by over 30%. These symptoms also trigger isolation and testing. Sore throat was a common upper respiratory symptom in all datasets (30% to 42%). Sneezing and rhinitis, only collected by Test and Trace, were reported by around one quarter of symptomatic cases. Gastrointestinal symptoms tended to be less frequent than systemic and respiratory, but were not unusual (mainly reported by 10-20% though less frequently for vomiting alone), with the exception of loss of appetite, which reported by between one quarter and one third of cases in Test and Trace and CSS, Figure S3 . A plot containing the proportion of cases that develop a symptom across datasets. Each dataset records a different set of symptoms, and in some dataset multiple symptoms are considered to be one variable. Looking at Fig. S12 , we see a global structure similar to what we observe in the main paper using the Aligne-dUMAP algorithm. The embeddings of most datasets can be described by a central cluster of systemic and lower respiratory tract symptoms. Upper respiratory tract symptoms, such as rhinitis, sneezing, hoarse voice, sore throat are typically placed close to the systemic symptoms cluster, with the exception of loss of smell and taste symptoms. Gastrointestinal symptoms are often placed further away from the upper respiratory tract symptoms, and often form a tail leading to some of the rarer symptoms. We note that these embeddings synthesise the results we observed from the LPCA loadings, where the second loading suggested that cases could be separated based upon whether they experience predominantly experienced upper respiratory tract symptoms, or systemic and gastrointestinal symptoms. The relatively low rates of occurrence of gastrointestinal symptoms explains their appearance high in the hierarchical tree, while the higher frequency of systemic and respiratory symptoms explains their relative importance in LPCA loadings, within the general structure revealed by UMAP. We repeat the UMAP analysis without alignment, this time with the algorithm tuned to focus more on the local structure of the data and less on the global structure of the data. This produces better separation of the symptoms into clusters in the low dimensional embeddings, however some of the relationship between these clusters may be lost. In the resulting embeddings, we observe several pairs of symptoms that commonly co-occur but appear to be distinct from the main of other symptoms, notably sneezing and rhinitis in the Pillar 2 and SGSS datasets, headache and sore throat in the CSS dataset, and loss of smell and taste in the CIS dataset. The remaining symptoms are often packed into 2 tight clusters. For Pillar 2 and SGSS, a clear separation between systemic and upper respiratory tract symptoms, and the less frequently occurring gastrointestinal, altered state and other symptoms is observed. Similarly gastrointestinal are placed into their own cluster in CIS, and in CSS with the exception of loss of appetite. Focusing more on the local structure can make the resulting embeddings more variable between datasets, as the choice of symptoms included in the dataset appears to make more of a difference. We note that the embeddings focusing more on local structure can be more variable between repeats, however they do highlight small local structures in the data. The aligned UMAP results in the main paper focus more on local structure, however the requirement to align several related slices of the datasets appears to make these results more consistent between runs. Looking at Fig. S13 and Fig. S12 , we see a global structure to the relationship between symptoms that synthesises other results. This is clearest in the CIS data, where we can draw a line from gastrointestinal through systemic, to respiratory tract symptoms, but with sore throat closer to cough than it is to loss of taste and smell. Such a line could be interpreted as describing a spectrum of COVID-19 symptoms. In the other datasets, this pattern is complicated by other types of symptom, which typically occur closest to gastrointestinal. The relatively low frequency of these symptoms explains their appearance high in the hierarchical tree, while the higher frequency of systemic and respiratory infections explains their relative importance in LPCA components within the general structure revealed by UMAP. We repeated our main analyses -hierarchical clustering, Logistic PCA and AlignedUMAP -on each dataset, stratified by broad age groups: children (0-17 years), adult (18-54 years) and elder adults (55+ years), Supplementary Figures S4-S21 . Broadly, we did not find strong differences in the clustering and co-occurrence patterns of symptoms across age groups and studies. The unstratified findings reflect more strongly the middle age category (18-54 years), who account for the majority of the sample in each dataset. It is possible that symptom data collection particularly among young children, which relies upon caregiver reports, could contribute to explaining some differences observed. The clear separation of gastrointestinal symptoms and loss of taste and smell is observed across the age strata in the CIS, Supplementary Figure S7 , with minor differences in the order at which some other individual symptoms join the tree (e.g. shortness of breath among children and sore throat amongst elder adults). In Pillar 2 and SGSS datasets, Supplementary Figures S4 and S5 respectively, across age groups the rarer symptoms separate earlier from other symptoms, with some later separation between systemic and upper respiratory symptoms observable. Patterns did not differ greatly across the age strata. Across age strata, symptoms among cases in the CSS, Supplementary Figure S6 , show shortness of breath and delirium (rare symptoms) separating early, followed by some gastrointestinal symptoms (diarrhoea and abdominal pain) and, most clearly among adults 18-54, splitting between systemic and gastrointestinal symptoms and primarily lower and upper respiratory symptoms. For all age-stratified LPCA analyses plotted in Supplementary Figures S8-S10 Similar patterns of separation between upper respiratory, systemic and gastrointestinal symptoms are seen across age groups when examining the UMAP embeddings when hyperparameters were selected that produce well separated clusters, Supplementary Figures S18-S21 . Despite the age-strata being coarser here than in the results of the main paper, Fig 4, we do observe similar structural changes to the data: in the children's age strata, we often observe the formation of several small clusters of symptoms; in the adult's age strata, the embeddings tend to resemble a larger cluster; and in the elder's age strata, the embeddings again start to fragment into two smaller clusters of symptoms. The structural changes are less striking than in the results in Fig. 4 , where finer age slices are used. However, this is expected, given that the coarser age strata used in Supplementary Figures S18-S21 make it harder for UMAP to detect structural changes to patterns of symptom co-occurrence that occur over small changes in age. The results from tuning the UMAP algorithm to focus more on global structure are plotted in Supplementary Figures S14-S17. Unlike in embeddings that focus more on the local structure of the dataset, we do not observe strong separation of symptoms into several small clusters in the youngest, or separation into two main clusters in the elderly population. This is to be expected, as focusing more on the global structure results in an embedding that attempts to describe more of the spectrum of the disease, and less on small groups of commonly co-occurring symptoms, providing a complementary analysis. Our interpretation is that, in the youngest and oldest age groups, patterns of co-occurrence of reported symptoms do change, particularly for pairs of symptoms, however we do not observe significant changes to the overall spectrum of the disease which can still be broadly described by number of symptoms experienced, and then the relative contribution of upper respiratory tract symptoms, or gastrointestinal symptoms. Across Pillar 2, SGSS and CIS we consistently observe a central cluster of systemic and lower respiratory tract symptoms. Upper respiratory tract symptoms are clustered close to the systemic symptoms, but further away from the gastrointestinal symptoms. The CSS dataset is the most different, where shortness of breath, fatigue and delirium are clustered close to gastrointestinal symptoms, but further away from the main cluster of systemic, upper respiratory tract and lower respiratory tract symptoms. Figure S4 . Hierarchical clustering of the Pillar 2 dataset with age stratification. Jaccard distance matrices between symptoms adjacent to associated dendrograms obtained through hierarchical clustering under complete linkage. The symptom category is denoted using coloured points at the roots of the dendrogram. The central columns give the name of the symptom with the percentage of symptomatic cases who exhibit symptom in the dataset. a. Children, b. Adults, c. Elders. Figure S5 . Hierarchical clustering of the SGSS dataset with age stratification. Jaccard distance matrices between symptoms adjacent to associated dendrograms obtained through hierarchical clustering under complete linkage. The symptom category is denoted using coloured points at the roots of the dendrogram. The central columns give the name of the symptom with the percentage of symptomatic cases who exhibit symptom in the dataset. a. Children, b. Adults, c. Elders. Figure S6 . Hierarchical clustering of the COVID Symptom Study dataset with age stratification. Jaccard distance matrices between symptoms adjacent to associated dendrograms obtained through hierarchical clustering under complete linkage. The symptom category is denoted using coloured points at the roots of the dendrogram. The central columns give the name of the symptom with the percentage of symptomatic cases who exhibit symptom in the dataset. a. Children, b. Adults, c. Elders. Figure S7 . Hierarchical clustering of the COVID-19 Infection Survey dataset with age stratification. Jaccard distance matrices between symptoms adjacent to associated dendrograms obtained through hierarchical clustering under complete linkage. The symptom category is denoted using coloured points at the roots of the dendrogram. The central columns give the name of the symptom with the percentage of symptomatic cases who exhibit symptom in the dataset. a. Children, b. Adults, c. Elders. Figure S12 . UMAP embeddings of SARS-CoV-2 symptoms. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. Point size is proportional to the proportion of cases that develop a given symptom. For this embedding the parameters were chosen to capture more of the global structure of symptoms and produces less well defined clusters, and was performed without any alignment between datasets. a. Pillar 2., b. SGSS, c. COVID Symptom Study, d. COVID-19 Infection Survey. Figure S13 . UMAP embeddings of SARS-CoV-2 symptoms. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. For this embedding the parameters were chosen to capture more of the local structure of symptoms and produces less well defined clusters, and was performed without any alignment between datasets. a. Pillar 2, b. SGSS, c. COVID Symptom Study, d. COVID-19 Infection Survey. Figure S14 . UMAP embeddings of SARS-CoV-2 symptoms performed on Pillar 2 dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. Point size is proportional to the proportion of cases that develop a given symptom. For this embedding the parameters were chosen to capture more of the global structure of symptoms and produces less well defined clusters. a. Children, b. Adults, c. Elders. Figure S15 . UMAP embeddings of SARS-CoV-2 symptoms performed on SGSS dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. Point size is proportional to the proportion of cases that develop a given symptom. For this embedding the parameters were chosen to capture more of the global structure of symptoms and produces less well defined clusters. a. Children, b. Adults, c. Elders. Figure S16 . UMAP embeddings of SARS-CoV-2 symptoms performed on COVID Symptom Study dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. Point size is proportional to the proportion of cases that develop a given symptom. For this embedding the parameters were chosen to capture more of the global structure of symptoms and produces less well defined clusters. a. Children, b. Adults, c. Elders. Figure S17 . UMAP embeddings of SARS-CoV-2 symptoms performed on COVID-19 Infection Survey dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. Point size is proportional to the proportion of cases that develop a given symptom. For this embedding the parameters were chosen to capture more of the global structure of symptoms and produces less well defined clusters. a. Children, b. Adults, c. Elders. Figure S18 . UMAP embeddings of SARS-CoV-2 symptoms performed on Pillar 2 dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. For this embedding, the parameters were chosen to produce well separated symptom clusters. a. Children, b. Adults, c. Elders. Figure S19 . UMAP embeddings of SARS-CoV-2 symptoms performed on SGSS dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. For this embedding, the parameters were chosen to produce well separated symptom clusters. a. Children, b. Adults, c. Elders. Figure S20 . UMAP embeddings of SARS-CoV-2 symptoms performed on COVID Symptom Study dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. For this embedding, the parameters were chosen to produce well separated symptom clusters. a. Children, b. Adults, c. Elders. Figure S21 . UMAP embeddings of SARS-CoV-2 symptoms performed on COVID-19 Infection Survey dataset with age stratification. The algorithm attempts to place combinations of symptoms that commonly co-occur close to each other. For this embedding, the parameters were chosen to produce well separated symptom clusters. a. Children, b. Adults, c. Elders. Risk factors and disease profile of post-vaccination SARS-CoV-2 infection in UK users of the COVID symptom study app: a prospective, community-based, nested, case-control study. The Lancet Infectious Diseases Occurrence and transmission potential of asymptomatic and presymptomatic SARS-CoV-2 infections: A living systematic review and meta-analysis Mobile app-based self-report questionnaires for the assessment and monitoring of bipolar disorder: Systematic review Living risk prediction algorithm (QCOVID) for risk of hospital admission and mortality from coronavirus 19 in adults: national derivation and validation cohort study Could expanding the COVID-19 case definition improve the UK's pandemic response? Asthma phenotypes in childhood Rapid implementation of mobile technology for realtime epidemiology of COVID-19 Symptom reporting in over 1 million people: Community detection of COVID-19. medRxiv A. Hayward, and on behalf of Virus Watch Collaborative. Symptom profiles and accuracy of clinical definitions for COVID-19 in the community. results of the virus watch community cohort. medRxiv Using a household-structured branching process to analyse contact tracing in the SARS-CoV-2 pandemic Data-driven identification of endophenotypes of Alzheimer's disease progression: implications for clinical trials and therapeutic interventions A global panel database of pandemic policies (Oxford COVID-19 Government Response Tracker) The Elements of Statistical Learning: Data Mining, Inference, and Prediction A new standard of care for the treatment of chronic HCV infection Dimensionality reduction for binary data through the projection of natural parameters Disagreement and factors between symptom on self-report and clinician rating of major depressive disorder: A report of a national survey in china UMAP: Uniform manifold approximation and projection for dimension reduction -umap 0.5 documentation Uniform Manifold Approximation and Projection for Dimension Reduction Robust, reproducible clinical patterns in hospitalised patients with Get tested for coronavirus (COVID-19) and COVID-19 Infection Survey Team. Community prevalence of sars-cov-2 in england from april to november, 2020: results from the ons coronavirus infection survey How informative are self-reported adult attention-deficit/hyperactivity disorder symptoms? an examination of the agreement between the adult attention-deficit/hyperactivity disorder self-report scale v1.1 and adult attentiondeficit/hyperactivity disorder investigator symptom rating scale Adherence to the test, trace, and isolate system in the UK: results from 37 nationally representative surveys Signs and symptoms to determine if a patient presenting in primary care or hospital outpatient settings has COVID-19 disease Symptom clusters in COVID-19: A potential clinical prediction tool from the COVID Symptom Study app Clinical characteristics of children and young people admitted to hospital with COVID-19 in United Kingdom: prospective multicentre observational cohort study Discordance between pediatric self-report and parent proxy-report symptom scores and creation of a dyad symptom screening tool (co-SSPedi) Funding: CSS funding: ZOE provided in-kind support for all aspects of building, running, and supporting the ZOE app and service to all users worldwide. Support for this study was provided by the National