key: cord-0314720-z6c3mq87 authors: Reese, J.; Blau, H.; Bergquist, T.; Loomba, J. J.; Callahan, T.; Laraway, B.; Antonescu, C.; Casiraghi, E.; Coleman, B.; Gargano, M.; Wilkins, K.; Cappelletti, L.; Fontana, T.; Ammar, N.; Antony, B.; Murali, T. M.; Karlebach, G.; McMurry, J. A.; Williams, A.; Moffitt, R.; Banerjee, J.; Solomonides, A. E.; Davis, H.; Kostka, K.; Valentini, G.; Sahner, D.; Chute, C. G.; Madlock-Brown, C.; Haendel, M. A.; Robinson, P. N. title: Generalizable Long COVID Subtypes: Findings from the NIH N3C and RECOVER Programs date: 2022-05-25 journal: nan DOI: 10.1101/2022.05.24.22275398 sha: 0302c8abbbd3f90c079a8873b2c19b958370c9dd doc_id: 314720 cord_uid: z6c3mq87 Accurate stratification of patients with Post-acute sequelae of SARS-CoV-2 infection (PASC, or long COVID) would allow precision clinical management strategies and could enable more focussed investigation of the molecular pathogenetic mechanisms of this disease. However, the natural history of long COVID is incompletely understood and characterized by an extremely wide range of manifestations that are difficult to analyze computationally. In addition, the generalizability of machine learning classification of COVID-19 clinical outcomes has rarely been tested. We present a method for computationally modeling long COVID phenotype data based on electronic healthcare records (EHRs) and for assessing pairwise phenotypic similarity between patients using semantic similarity. Using unsupervised machine learning (k-means clustering), we found six distinct clusters of long COVID patients, each with distinct profiles of phenotypic abnormalities with enrichments in pulmonary, cardiovascular, neuropsychiatric, and constitutional symptoms such as fatigue and fever. There was a highly significant association of cluster membership with a range of pre-existing conditions and with measures of severity during acute COVID-19. We show that the clusters we identified in one hospital system were generalizable across different hospital systems. Semantic phenotypic clustering can provide a foundation for assigning patients to stratified subgroups for natural history or therapy studies on long COVID. Hundreds of millions of cases of acute Coronavirus disease 2019 (COVID -19) have been recorded since the beginning of the pandemic, and more than six million deaths had been reported by the World Health Organization by the end of March, 2022. 1 The clinical presentation of COVID-19 ranges from asymptomatic infection to fatal disease, with many patients continuing to have heterogeneous, long-term, multi-system symptoms including fatigue, post-exertional malaise, dyspnea, cough, chest pain, palpitations, headache, arthralgia, weakness (asthenia), paresthesias, diarrhea, alopecia, rash, impaired balance, and memory or cognitive dysfunction. 2 3 Although there is still no detailed and widely accepted case definition, post-acute sequelae of SARS-CoV-2 infection (PASC, long-haul COVID or long COVID) generally refers to a range of persistent or new symptoms beyond three or four weeks of the initial infection. [4] [5] [6] The NIH REsearching COVID to Enhance Recovery (RECOVER) Initiative program defines PASC as ongoing, relapsing, or new symptoms, or other health effects occurring after the acute phase of SARS-CoV-2 infection (i.e., present four or more weeks after the acute infection). The World Health Organization (WHO) has developed a case definition of "post COVID-19 condition" suggesting that the syndrome is usually diagnosed several months after the onset of acute symptoms of COVID-19 based on new-onset or lingering symptoms (e.g., fatigue, dyspnea, cognitive dysfunction) which cannot be explained by an alternative etiology and which continue for at least two months. 7 In this work, we will use the term long COVID to refer to patients given a diagnosis using the newly introduced ICD-10 U09.9 code ("Post COVID-19 condition"), acknowledging that it is impossible to know which definition was used for the diagnosis. The pathogenesis of long COVID is incompletely understood, but it appears likely that different pathogenetic mechanisms or combinations thereof may drive disease in individual patients. Potential factors that may contribute to the development of long COVID include aberrant immune responses, persistent viral replication, redox imbalance, formation of fibrinolysis-resistant amyloid fibrin microclots, and consequences from acute SARS-CoV-2 injury to one or multiple organs. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] At present, there is no specific treatment for long COVID and there is a great necessity to garner a better understanding of long COVID subtypes. Our understanding of the natural history of long COVID is still incomplete. Limited emerging evidence suggests the existence of clinical subtypes or clusters characterized by the predominance of symptoms such as pain, cardiovascular manifestations, or by paucity of symptoms, 18 but computational methods to characterize long COVID subtypes based on a comprehensive phenotypic analysis are lacking, as are approaches to assess the generalizability of clustering approaches across different patient cohorts. In this study, we constructed a cohort of 2464 patients diagnosed with long COVID using the newly introduced ICD-10 U09.9 code ("Post COVID-19 condition") from multicenter electronic health record (EHR) data derived from the National COVID Cohort Collaborative (N3C), a harmonized EHR repository with 2,909,292 COVID-19 positive patients as of March 16, 2022 . Previous work mapped 287 unique clinical findings previously reported in studies on long COVID 19 to the Human Phenotype Ontology (HPO), which is widely used to support differential diagnosis and translational research in human genetics. 20, 21 Here, we introduce a semantic approach that identifies patient similarity by transforming EHR data to phenotypic profiles using the HPO, and identify distinct clusters of long COVID patients that displayed highly significant correlations with pre-existing conditions and were generalizable across different hospital systems. As of March 16, 2022 , the N3C platform ("Enclave") contained data for 2,909,292 patients diagnosed with acute COVID-19, and 21 data partners had begun to use the newly introduced ICD-10 diagnosis code U09. 9 for Post COVID-19 condition, providing data for 5,645 patients with this diagnosis (Fig. 1 ). Phenotypic features observed in the post-acute COVID-19 period were mapped from OMOP codes to HPO terms. The post-acute COVID-19 period was defined as starting 21 days after the earliest COVID-19 index date for outpatients, and 21 days after the end of hospitalization for inpatients. The COVID-19 index date for each patient was defined as the earliest date of any positive PCR or antigen SARS-CoV-2 test or COVID-19 U07.1 diagnosis. Patients with long COVID (U09.9 diagnosis) were extracted from the much larger dataset of the N3C. Long COVID patients were selected from the five data partners that provided data for at least 300 U09.9 patients and had an average of at least 7 long COVID HPO terms per patient. The data partner with the most U09.9 patients (data partner 1) was chosen for clustering, and additional U09.9 patients from four other data partners (data partners 2-5) were chosen to assess generalizability. We hypothesized that consistent subgroups of patients with long COVID can be defined based on the spectrum of phenotypic features in the patients' electronic health records (EHR). Our previous analysis identified 287 clinical findings previously reported in studies on long COVID and coded these findings using terms of the Human Phenotype Ontology (HPO). 19, 21 Numerous algorithms have been developed that define a fuzzy, specificity-weighted similarity metric between a patient and a computational disease model or between pairs of patients. [22] [23] [24] [25] Here, we adapted an algorithm called Phenomizer that calculates semantic similarity between a pair of patients based on phenotypes (Methods). 26 To leverage this procedure for analysis of N3C data, we mapped the 287 long COVID-associated HPO terms 19 to corresponding Observational Medical Outcomes Partnership (OMOP) codes 27 (see Methods). Of these, 116 terms were identified in the data (Supplemental Tables S1-S11). The terms not found in the data largely were clinical or patient-reported features that are not commonly represented in EHR data, such as Centrilobular ground-glass opacification on pulmonary HRCT (HP:0025180) or Ocular pruritus (HP:0033841), and were not included in further analyses. We selected data partners that provided at least 300 U09.9 patients and an average of at least seven HPO terms per patient (Fig 1) . This threshold was chosen to include data partners with a sufficient number of patients with a sufficient depth of phenotypic information available in EHR data to assess patient similarity. For clustering, we selected U09.9 patients from the data partner (referred to here as data partner 1, as data regulations disallow use of real data partner names or IDs) that supplied data for the greatest number of U09.9 patients (1233 patients). For assessment of the generalizability of the clusters to other data partners, we selected the remaining U09.9 patients from the remaining data partners (referred to here as data partners 2-5, again due to data regulations) (1,231 patients). We calculated the frequency with which each term was used in the total group of 1233 patients from data partner 1 and used this value to determine the information content (a measure of specificity; see Methods) for each term. In order to calculate pairwise phenotypic similarity of patients at data partner 1 for clustering, we adapted the Phenomizer algorithm (Fig. 2) . This resulted in a 1 2 3 3 ൈ 1 2 3 3 similarity matrix for the 1233 patients at data partner 1. K-means clustering was applied to the data and the number of clusters was determined to be 6 based on visual inspection of the 'elbow' curve ( Fig. 3; Supplemental Figure 1 ). Calculating patient semantic similarity based on HPO phenotypes. A) HPO terms are arrang a directed acyclic graph with specific terms such as Bradycardia (HP:0001662) being related to more ge terms (here: Arrhythmia; HP:0011675) by subtype relations. An excerpt of the entire ontology (15,247 te is shown. B) Example showing a pair of patients with relatively high phenotypic similarity; for each of the HPO terms in patient 1, the best match is sought in patient 2. If an exact match is not found, the algorith searches for the most informative common ancestor (MICA) in the ontology; the information content (a measure of specificity) of the exact matching term or most specific ancestor term is calculated to determ the specificity. For instance, Visual hallucinations (HP:0002367) and Auditory hallucinations (HP:000876 are not an exact match, so the information content of their MICA Hallucinations (HP:0000738) is chosen Hallucinations (HP:0002367) is still relatively specific (and shown in gray), while the MICA of Angina pec (HP:0001681) and Hypotension (HP:0002615) is more general (shown in red) and contributes less to the matching score. Hallucinations (HP:0002367) is still relatively specific (and shown in gray), while the MIC Angina pectoris (HP:0001681) and Hypotension (HP:0002615) is more general (shown in red) and contributes less to the matching score. C) Example of a pair of patients with a relatively lower similarity d to (specific) fewer exact matches and one unmatched term. The pairwise similarity is calculated in this w for all pairs of patients to construct the similarity matrix that is used for clustering (Fig. 3) . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 25, 2022. ; Figure 3 . Patient similarity matrix illustrating long COVID subtypes in data partner 1. A heatmap representing the 6 clusters created by k-means clustering is shown. Cluster hierarchy was calculated us nearest point algorithm and Euclidean distance. using the . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We characterized the features of each of the six clusters with respect to age, gender, and race/ethnicity ( Table 1 ). The six clusters contained between 70 and 301 patients, and differed significantly with respect to rate of hospitalization, age, gender, and ethnicity. Patients in clusters 1 and 6 were overall older, more likely to have been hospitalized during their acute COVID-19 infection, more likely to be male, and were less likely to be of White non-Hispanic race/ethnicity. Patients in clusters 3, 4, and 5 were almost entirely non-hospitalized, younger, and more likely to be female. To further characterize each of the six clusters, we identified HPO terms that tended to occur among patients in certain clusters (Fig. 4) . Of the 287 HPO terms we identified as being used in published cohort studies on long COVID 19 , only 116 were identified in our data. The presence or absence of each of the 116 HPO terms used for clustering was treated as a categorical variable whose distribution among the six clusters was assessed using a chi-squared test. Of the 116 HPO terms that were tested, 63 were significantly correlated with cluster membership following Bonferroni correction. Of these, 26 terms had a corrected p-value of less than 1 0 ି ହ and were present in at least 20% of patients in one or more clusters and were therefore considered to be the characteristic features that best defined the clustering. HPO terms were classified into these categories: cardiovascular, pulmonary, endocrine, ear nose and throat (ENT), eye, gastrointestinal, immunology, laboratory, neuropsychiatric, skin, and constitutional. The latter category encompasses symptoms and findings such as Fatigue (HP:0012378), Night sweats (HP:0030166), and Xerostomia (HP:0000217) that cannot be unambiguously assigned to a single organ system. UpSet plots 28 were used to visualize the salient characteristics of each cluster according to these categories. visualizations show not only the most common categories, but also the most common combinat categories. For instance, in cluster 1, patients most commonly had HPO terms from the categories pulm neuropsychiatric, general, gastrointestinal, cardiovascular, and ear nose throat (ENT), and the singl common category overall was pulmonary. Although there was some overlap in the distribution of featur profiles of terms and categories were distinct for the six clusters (FIg. 4) . Shown are the most freque occurring high-level HPO categories for patients in the overall cohort (A) and for each of the 6 clusters ( the overall population of patients in data partner 1 and for each cluster, the frequency of each category COVID HPO terms (left) and the frequency of the three most common combinations of HPO categorie are shown. Notably, most clusters contain some widely shared features, but also distinguishing feature as symptoms in the pulmonary, neuropsychiatric, and cardiovascular systems. Data are shown as UpSe which visualize set intersections in a matrix layout and show the counts of patients with the comb indicated by the black dots as bars above the matrix. 28 The most commonly occurring HPO category cluster is highlighted. HPO term combinations that occur less than 20 times are masked to limit the patient re-identification. Marked differences among groups were seen in the frequency with which certain symptoms were ob For example, Nasal Congestion (HP:0001742) was frequent (~31%) in cluster 4, and Cough (HP:00 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. was especially common (>60% of patients) in clusters 2 and 6 compared with the other clusters, although appreciable rates of Cough (HP:0012735) were seen among all clusters. Cardiac or potential cardiac signs and symptoms, such as Palpitations (HP:0001962), Tachycardia (HP:0001649), or Chest pain (HP:0100749), were relatively common in clusters 5 and 6 compared with the other clusters (for example, half or more of patients in cluster 5 had each of these symptoms), although chest pain was also seen in ~31% of cluster 2 patients. Hypotension (HP:0002615) was most common in cluster 6. Pain (HP:0012531) and Fatigue (HP:0012378) were relatively frequent in clusters 2, 3, and, particularly, clusters 5 and 6 (rates for these symptoms ranged from ~56-79% in the latter two clusters). Cluster 6 was also notable for a high frequency of other constitutional symptoms, including Fever (HP:0001945), Asthenia (HP:0025406), and Myalgia (HP:0003326), as well as a number of gastrointestinal symptoms, such as Abdominal pain (HP:0002027), Diarrhea (HP:0002014), and Nausea (HP:0002018). Vertigo (HP:0002321) was common in cluster 5 (~34%) and cluster 6 (~25%). Depression (HP:0000716) and Headache (HP:0002315) were more common in clusters 3 and 6 versus other cohorts, and Insomnia (HP:0100785) was most frequent in cluster 6 ( Fig. 5 ). Both advanced age and female sex have been associated with an increased risk of developing long COVID. 29 Interestingly, the average age in clusters 1 and 6 was higher than that in the other clusters, but the proportion of women in these clusters was lower than in three of the other four clusters. Both clusters 1 and 6 showed a high frequency of post-acute COVID-19 laboratory abnormalities that have been associated with severe course of acute COVID-19, namely, Lymphopenia (HP:0001888), Elevated circulating alanine aminotransferase concentration (HP:0031964), Increased circulating ferritin concentration (HP:0003281), Elevated circulating alkaline phosphatase concentration (HP:0003155), Hypocalcemia (HP:0002901), and Thrombocytopenia (HP:0001873). [30] [31] [32] [33] [34] [35] This, and the fact that the average age was higher and the overall frequency of annotations with HPO terms was higher in these clusters ( Supplemental Fig 1) , suggests that clusters 1 and 6 may represent patients with residual manifestations of more severe COVID-19 and/or long COVID manifestations, although severity cannot unambiguously be inferred from EHR data. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. To investigate how clinical features before or during COVID-19 infection correlated with cluster memb we assessed the distribution across the six clusters of 44 clinical features determined prior to acute CO or during acute COVID-19. Of these, 19 displayed a statistically significant difference between clusters a shown in Tables 2 and 3 . Among parameters that were present before acute COVID-19 (Table 2) , 13 d significantly between clusters. Chronic lung disease, peripheral vascular disease, kidney disease, di coronary artery disease, heart failure, and acute kidney injury (AKI) were all more frequent in clusters 1 ( Table 2 ). The risk of long COVID has been shown to be associated with the number of comorbi n's chishown mbership, OVID-19 s and are 3 differed diabetes, s 1 and 6 rbidities. 36 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. Additionally, obesity, which has been shown to be a risk factor for long COVID, 37 was also more common in clusters 1 and 6. These observations are consistent with the notion that clusters 1 and 6 are composed of patients with more severe clinical manifestations, and that there may be different risk factors for clusters 2-5. Covariates during acute COVID-19 whose frequencies were higher in clusters 1 and 6 included acute kidney injury (AKI) and medications such as corticosteroids, remdesivir, and vasopressors that may be proxies for a severe clinical course ( Table 3) . Severity of acute COVID has been associated with risk of persistent symptoms in some studies. 38 Table S13 ) that were significantly overrepresented . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101/2022.05.24.22275398 doi: medRxiv preprint in clusters (chi squared p < 0.001 after Bonferonni correction) and the percent of patients in each cluster with each clinical feature are shown. The results presented in the previous sections were generated with data from data partner 1. We assessed the generalizability of the clustering results for four additional data partners (data partners 2-5, Fig. 1 ) by comparing each patient in these data partners with the patients in each cluster from data partner 1 and also to randomly permuted clusters (Methods). If the clusters in data partner 1 did not generalize at all to other data partners, we would expect that patients from other data partners would be equally similar to the patients of any of the clusters in data partner 1. We observed that patients from data partners 2-5 were much more similar to clusters from data partner 1 compared to randomly permuted clusters. The mean similarity ranged from 0.179 to 0.182 for test data partners 2-5 for the randomly permuted clusters, but the observed mean similarities to the original clusters at data partner 1 ranged from 0.270 to 0.300, corresponding to z-scores of 150 to 266. The mean similarity score for the randomly permuted clusters was never as high as the observed score over 1000 permutations, corresponding to an empirical p-value of less than 0.001 for each of the data partners 2-5. This strongly suggests that clusters identified in data partners 1 generalize to patients from other data partners (Table 4) . Observed mean similarity Z-score Empirical p-value Table 4 . Generalizability of clusters in patients from new data partners. The similarity of patients from test data partners 2-5 to patients from clusters generated from data partner 1, and to patients from randomly permuted clusters was measured as in Fig 2. For patients from the given data partner, the average similarity of patients to the best matching randomly permuted cluster and to the best matching cluster from data partner 1, as well as the Z-score and p-value for each test data partner are shown. The empirical p-value reflects the number of times that the similarity of a permuted dataset was higher than that of the observed clusters (this never occurred). According to the World Health Organization, approximately 10-20% of patients with COVID-19 may experience new-onset, lingering or recurrent clinical symptoms after acute infection. This has been termed 'post-acute sequelae of SARS-CoV-2 infection' (PASC) or long COVID. Definitions of long COVID in the literature vary, and the frequencies and time course of phenotypic manifestations following acute COVID-19 are highly heterogeneous. 19 This observation raises the question of whether long COVID can be stratified into well delineated and reproducible subtypes, or whether the degree of heterogeneity is so high that stratification is impossible. This is critically relevant for defining sub-cohorts for clinical research studies such as the NIH . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; program "Researching COVID to Enhance Recovery (RECOVER)," as well as to identify candidate therapeutics. ML clustering methods offer a data-driven approach to stratification of patients to reveal such subtypes in the face of this heterogeneous, new disease. Evidence available prior to our study suggests that important clinical differences do exist that influence the susceptibility to subsequent complications of COVID-19. For instance, although males are more likely to be hospitalized or die with acute COVID-19, females are more likely to develop long COVID. 39 It is possible that the pathophysiology of long COVID may be multifactorial in origin. Conceivably, the biological underpinnings of long COVID may vary among individuals as a function of baseline risk factors, resulting in different general phenotypes of long COVID, the treatment or prevention of which may need to be specifically tailored using precision medicine in order to achieve optimal outcomes. As a first step, we sought to use unsupervised learning to delineate potential subtypes of patients with long COVID with differing clinical characteristics. We identified six published studies that present clusters from either patient-reported data (in four studies) or manually recorded clinical data (two studies) with cohorts of between 145 and 3762 patients. The studies report two or three clusters based on different types of input data, making study comparison challenging. None of the studies were based on EHR data and no assessment of generalizability to other data partners was presented. 18, 29, [40] [41] [42] [43] Here we have presented a novel method for semantic clustering of long COVID patients based on HPOencoded EHR data. We further present a method for assessing generalizability of the identified subtypes or clusters across different data contributing sites. Ontology-based algorithms differ from machine learning and other algorithms in many ways. Coding numerical data with HPO implies that parameters are simplified into categories. Although this loss of numerical data reduces precision in data granularity, simplification allows powerful simultaneous analysis of all phenotypic observations using semantic similarity that can take the relatedness of concepts into account. Our method for assessing patient-patient similarity using the Phenomizer algorithm generates an essentially continuous similarity value from arbitrary sets of HPO terms that characterize any two patients. An alternative method would be to encode the 287 HPO terms as a 287-dimensional feature vector and to measure similarity for example using dot product (cosine) of these vectors. The Phenomizer algorithm has several advantages over the feature vector method: it does not suffer from sparse count issues that may make clustering less robust, 46 and it takes advantage of the similarity between individual items using the structure of the HPO in a way that a feature vector cannot. 26 This approach has proven powerful both in the support of differential diagnosis of rare disease and in efforts to enable longitudinal analysis of EHR data as a means of identifying gene-phenotype associations with Mendelian forms of epilepsy, 44 ,45 but has never before been applied in the context of infectious disease EHR data and methods for assessing generalizability have not previously been presented. We have shown that unsupervised learning based on semantic clustering identifies phenotypic profiles that are reproducible across data partners with a high degree of statistical significance. The six clusters that emerged demonstrated non-uniform frequencies of symptoms and clinical findings across an array of features, spanning constitutional/systemic symptoms and pain, cardiac, respiratory, gastrointestinal, and neurologic symptom domains, with some degree of overlap but clear distinctions between various groups. We interpret our clusters 1 and 6 as comprising patients with a severe course of acute COVID-19 because of the higher hospitalization rates (Table 1 ) and the higher rates of mechanical ventilation and use of medication such as vasopressors that indicate a relatively severe course (Table 3) . Interestingly, cluster 1 was male-predominant (59.1%) and cluster 6 was female predominant (58.0%). The higher rates of most pre-existing comorbidities in patients from . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; clusters 1 and 6 are in accordance with the notion of more severe clinical courses. Our results show that these subgroups tended to be affected by a wider range of clinical complications in the post-acute course, because, for instance, the most common profile of HPO terms involved six of nine clinical categories in cluster 1 and seven of nine in cluster 6 (Fig. 4) . Our findings confirm and extend previous findings of a steeper risk gradient for long COVID manifestations that increases according to the severity of the acute COVID-19 infection. 47 The relatively high rate of pre-COVID corticosteroid use in our study (with the lowest rate being 37.7% in cluster 4 and the two highest rates 61.3% in cluster 1 and 71.6% in cluster 6) is striking. Dexamethasone use was associated with lower 28-day mortality among those who were receiving either invasive mechanical ventilation or oxygen but not among those receiving no respiratory support. 48 However, methylprednisolone use may be associated with increased mortality and more severe neuromuscular weakness in some patients with acute respiratory syndrome (ARDS) 49 and there are reasons to believe that protracted corticosteroid therapy could contribute to the development of some long COVID manifestations such as fatigue, myopathy, neuromuscular weakness, and psychiatric symptoms 50 . However, future work will be needed to determine what causal role, if any, steroid use has in the development of long COVID. A substantial body of evidence documents a sex difference in the severity of acute COVID-19, with a more favorable course of the disease in women compared to men regardless of age. 51 Emering evidence suggests that the clinical manifestations of long COVID may also be characterized by sex differences. [52] [53] [54] Our results show a cluster with predominantly hospitalized and male patients (cluster 1) and other clusters with predominantly non-hospitalized and female patients (clusters 3 and 4), which suggests that males and females may differ with respect to long COVID manifestations. A focused, prospective study could help to clarify the extent potential sex differences in long COVID. We suggest that analogous algorithms could be used to evaluate data gathered from prospective studies of long COVID patients to extend and deepen our characterization of phenotypic clusters by including data that are currently difficult to ascertain reliably from EHR data, including symptoms such as Asthenia (HP:0025406) or Exertional dyspnea (HP:0002875) and radiology findings (which are typically not represented using structured fields in EHR data and are underrepresented in OMOP datasets). The recently released Phenopacket Schema of the Global Alliance for Genomics and Health (GA4GH) provides a standardized way to record clinical findings including phenotypic features, measurements, biospecimens, and medical actions over the time course of a disease as a computational case report. 55 Recording clinical data with the Phenopacket Schema would promote data sharing and comparability of results from different studies. While our study provides insight into the variability and natural history of long COVID, there are limitations that should be considered. While the U09.9 code provides a simple inclusion criterion, its application in health systems across the country is not uniform and may differ from one data partner to another. Also, since the use of the code began only recently, patients with long COVID that were diagnosed prior to the introduction of the code are not included, limiting our ability to compare the current clinical manifestations with those observed earlier in the pandemic before widespread vaccination and with different distributions of SARS-CoV2 strains and variants. However, in a pilot study in Denmark, coding with U09.9 was found to have a positive predictive value of 94% for long COVID. 56 Our ability to capture clinical manifestations of long COVID is limited by the accessibility of clinical data in EHR systems. Of the 287 HPO terms we identified as being used in published cohort studies on long COVID, 19 only . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; 116 were identified in our data. The reasons for this presumably include unstructured data such as symptoms and radiological findings that are not well represented in the OMOP data that is the source of our data. Examples include Gaze-evoked nystagmus (HP:0000640), Pericardial effusion (HP:0001698), and Exercise intolerance (HP:0003546) that are typically diagnosed using specialist examinations or medical history that may not be easily coded in structured EHR fields. Additionally, several common manifestations of long COVID, including dysautonomia, 57 are less documented in EHR data in part due to the difficulties in recognizing these illnesses clinically and the fact that relevant findings may not be well represented in structured fields including the OMOP data available in N3C. Our study uses the newly minted ICD code U09.9 to identify patients with PASC/long COVID. At the time of this writing, a relatively small number of affected patients was available for analysis. Furthermore, the population defined by these patients is not fully representative of the American population; for instance, the proportion of African Americans in our study (~5%) is lower than the proportion of African Americans among the entire population. As more data accrues, future work will be required to characterize the role of social determinants of health that are confounded with race in our society in determining long COVID subtypes. It is likely that many additional long COVID patients are present in the N3C dataset who have not received the U09.9 diagnosis code, and it is possible that this fact could introduce a bias into the data analyzed in this study. Additionally, the group of patients who present for medical care for long COVID symptoms and receive a U09.9 diagnostic code may not be representative of the entire population of patients with long COVID manifestations. Our exploration of k-means clustering results with different values of k from 2 to 8 showed that increasing the number of clusters tended to subdivide existing clusters hierarchically. Although numerous methods for determining the 'best' number of clusters are available, there is no objective definition of optimum that applies to all applications, and the choice of k is perforce subjective in nature. Our main findings of generalizable phenotypic clusters pertain also for values of k of 4 and 5 (Supplemental Fig. S2-S3 ). We have presented a novel algorithm for semantic clustering that identifies patient similarity by transforming EHR data to phenotypic profiles using the HPO, and reviewed long COVID subtypes that show a statistically significant degree of generalizability of clusters across different medical centers. The clusters expand our knowledge of clinical profiles of long COVID. Semantic phenotypic clustering could provide a basis for assigning patients to stratified subgroups for natural history or therapy studies. The N3C data transfer to NCATS is performed under a Johns Hopkins University Reliance Protocol #IRB00249128 or individual site agreements with NIH. The N3C Data Enclave is managed under the authority of the NIH; information can be found at https://ncats.nih.gov/n3c/resources. We obtained patient data from the National COVID Cohort Collaborative (N3C; covid.cd2h.org). N3C aggregates and harmonizes EHR data across multiple clinical organizations in the United States, including the Clinical and Translational Science Awards (CTSA) Program hubs. N3C harmonizes EHR data across four . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; https://doi.org/10.1101/2022.05.24.22275398 doi: medRxiv preprint clinical data models and provides a unified analytical platform in which data are encoded using the Observational Medical Outcomes Partnership (OMOP) 27 version 5.3.1. The Centers for Disease Control (CDC) announced an International Classification of Diseases, version 10 (ICD-10) code (U09.9) for emergency/provisional use on June 30, 2021. The code represents Post COVID-19 condition, unspecified. Use of the code was approved for implementation effective October 1, 2021. The code should be used for patients with a history of probable or confirmed SARS CoV-2 infection who are identified with a post-COVID condition. The data freeze date was March 16, 2022 . Only patients with an initial COVID-19 diagnosis within the Enclave were included in the cohort. At the time of the data freeze for this analysis, 21 participating data partners were using the code, and a total of 5645 patients were coded in this way. The HPO is a rich representation of the diversity of phenotypic features associated with human disease and is the de facto standard for the computational analysis and exchange of phenotype data in human genetics. 20, [58] [59] [60] [61] [62] The HPO comprises over 16,000 terms that denote specific phenotypic abnormalities at increasingly specific granularity, for example, Atrial septal defect (HP:0001631) and Interrupted inferior vena cava with azygous continuation (HP:0011671). We recently identified 287 unique clinical findings reported in cohorts of patients with long COVID and mapped them to existing HPO terms and in some cases created new HPO terms to cover COVID-specific features such as Pseudo-chilblains on toes (HP:0034036). 19 The 2020-08-11 release of the HPO was used in our study. To obtain mappings between standard OMOP condition concept identifiers and HPO concepts, we used OMOP2OBO (https://github.com/callahantiff/OMOP2OBO) 63 and LOINC2HPO. 64 The OMOP2OBO algorithm creates and validates mappings between OMOP terminology concepts and concepts from the Open Biomedical Ontologies, 65 using a variety of alignment strategies and with varying levels of confidence. For this project, we filtered the v1.0.0 release of mappings to only include exact 1:1 mappings at the concept level. This mapping set aligned 4,767 OMOP concept IDs to 3,804 unique HPO concepts (1.25 OMOP concept IDs/HPO concept). To apply LOINC2HPO mappings from OMOP to HPO concepts, we reimplemented the LOINC to HPO mappings in the N3C Enclave. For any HPO term that was among the 287 HPO terms associated with long COVID, we determined for each patient in our study group the LOINC codes present in the measurement OMOP table determined to be 'low', 'high', or 'positive' compared to the reference range for the test in question, and assigned the HPO term to the patient if the test occurred during the long COVID period for that patient (starting 21 days after diagnosis of acute COVID-19 for outpatients, and 21 days after hospitalization for inpatients). We previously developed a method called Phenomizer for clinical diagnostics that uses the semantic structure of the HPO to weight clinical features on the basis of specificity and to identify those clinical features that best distinguish among the top candidate differential diagnoses. 26 The algorithm represents the clinical specificity of a finding as the information content (IC) of a term. Given a set of diseases of interest in the differential diagnosis process, the frequency of each HPO term is defined as the proportion of diseases in a database that are annotated by the term or any of its descendent terms (for instance, the HPO resource currently comprises 8,260 Mendelian diseases). 21 The IC is then defined as the negative natural logarithm of the term frequency. 66 The true path rule applies to all terms in the HPO. That is, if a disease is annotated to the term ‫ݐ‬ , it is implicitly annotated to all ancestors of ‫ݐ‬ recursively (for instance, Marfan syndrome is annotated to Aortic root aneurysm (HP:0002616), and it is therefore implicitly annotated to the parent term Thoracic aortic aneurysm . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 25, 2022. ; (HP:0012727) and its parent term Aortic aneurysm (HP:0004942), and so on. Thus, the IC of terms increases as we move from the root term of the HPO ontology to the more specific descendent terms. . In essence, this procedure leverages the ontological structure of the HPO to perform specificity-weighted fuzzy matching. In the Phenomizer algorithm, the similarity between a set of query terms (symptoms, signs, etc.) entered by a physician for an individual case is used to calculate a similarity score for each of the diseases in the HPO database as an aid in differential diagnosis. In the current work, we adapt this algorithm to implement semantic phenotypic-based clustering by using the Phenomizer framework to calculate a matrix of pairwise phenotypic similarities between all patients in the long COVID cohort. In the following, we represent the set of ݅ and ܿ is the number of clusters (a user-chosen hyperparameter). Using a previously described method, ܿ cluster centroids were chosen such that centroids were distant from one another. 67 Clusters were then formed iteratively such that the Euclidean distance between the vector that represents any object and the centroid vector of its cluster was at least as small as that between the object and any of the other clusters. In each iteration, objects were moved to the cluster with the closest centroid, following which the centroids were recalculated until no further improvement was obtained or the maximum number of iterations (100) was reached. 68 The k-means clustering method does not determine the 'optimal' number of clusters. We used the elbow method to choose the number of clusters. This method computes the total within-cluster sum of squares error (SSE) for each candidate number of clusters. The SSE is plotted against the number of clusters and an 'elbow' in the curve is used to determine the number of clusters. We first performed clustering on patients from the data partner with the greatest number of U09.9 long COVID patients. For brevity, we will refer to this as data partner 1. We then assessed reproducibility of clustering results in data partners 2-5 as explained below. This approach was chosen given the inherent challenge (noted in the literature [69] [70] [71] ) that we lack a generally applicable method for assessing any given clustering approach. For brevity, we will refer to these data partners 2-5 as the test data partners. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 25, 2022. The HPO terms for patients from data partner 1 and their assignment to k-means clusters were recorded. We reasoned that if the clustering results in data partner 1 are generalizable, then patients of the test data partners will tend to display more similarity to one or other cluster of data partner 1 than one would expect by chance. Assuming . To generate a null distribution of this statistic, we create 1,000 permuted cluster assignments by assigning each patient from data partner 1 uniformly at random to one of the k clusters. We compute the test statistic for each of these random cluster assignments and record the mean and standard deviation of these values. We present the results as a z score calculated as The HPO terms assessed in the above procedures were derived from clinical data at least 21 days after the initial bout of COVID-19. We analyzed additional clinical covariates covering items such as comorbidities and medications prior to and during acute COVID-19 (Supplemental Tables S2-S3 ). Categorical variables were assessed with a chi-squared test if at least five counts were present for each cell of the contingency table and numerical variables were assessed with one-way ANOVA. Analysis was done using R version 3.5.1. Weekly operational update on COVID-19 Long COVID: An overview Incidence, co-occurrence, and evolution of long-COVID features: A 6-month retrospective cohort study of 273,618 survivors of COVID-19 Characterising long COVID: a living systematic review Post-acute COVID-19 syndrome Long covid-mechanisms, risk factors, and management A clinical case definition of post-COVID-19 condition by a Delphi consensus Long COVID or Post-acute Sequelae of COVID-19 (PASC): An Overview of Biological Factors That May Contribute to Persistent Symptoms Markers of Immune Activation and Inflammation in Individuals With Postacute Sequelae of Severe Acute Respiratory Syndrome Coronavirus 2 Infection Immuno-proteomic profiling reveals aberrant immune cell regulation in the airways of individuals with ongoing post-COVID-19 respiratory disease Immune-Based Prediction of COVID-19 Severity and Chronicity Decoded Using Machine Learning Multiple early factors anticipate post-acute COVID-19 sequelae COVID-19 in an immunocompromised host: persistent shedding of viable SARS-CoV-2 and emergence of multiple mutations: a case report SARS-CoV-2 infection and persistence throughout the human body and brain A central role for amyloid fibrin microclots in long COVID/PASC: origins and therapeutic implications Insights from myalgic encephalomyelitis/chronic fatigue syndrome may help unravel the pathogenesis of postacute COVID-19 syndrome Redox imbalance links COVID-19 and myalgic encephalomyelitis/chronic fatigue syndrome Identification of Distinct Long COVID Clinical Phenotypes Through Cluster Analysis of Self-Reported Symptoms Characterizing Long COVID: Deep Phenotype of a Complex Condition The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease The Human Phenotype Ontology in 2021 Computational analysis of 10,860 phenotypic annotations in individuals with SCN2A-related disorders International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity Improved exome prioritization of disease genes through cross-species phenotype comparison Interpretable Clinical Genomics with a Likelihood Ratio Paradigm Phenolyzer: phenotype-based prioritization of candidate genes for human diseases Clinical diagnostics in human genetics with semantic similarity searches in ontologies Feasibility and utility of applications of the common data model to multiple, disparate observational health databases UpSet: Visualization of Intersecting Sets Attributes and predictors of long COVID Long COVID or post-COVID-19 syndrome: putative pathophysiology, risk factors, and treatments Elevations in Liver Transaminases in COVID-19: (How) Are They Related? Hyperferritinemia in critically ill COVID-19 patients -Is ferritin the product of inflammation or a pathogenic mediator? A High Percentage of Patients Recovered From COVID-19 but Discharged With Abnormal Liver Function Tests Hypocalcemia is associated with severe COVID-19: A systematic review and meta-analysis Altered platelet and coagulation function in moderate-to-severe COVID-19 Persistent symptoms 1.5-6 months after COVID-19 in non-hospitalised subjects: a population-based cohort study Obesity and lipid metabolism disorders determine the risk for development of long COVID syndrome: a cross-sectional study from 50,402 COVID-19 patients Post-COVID syndrome: A single-center questionnaire study on 1007 participants recovered from COVID-19 The four most urgent questions about long COVID Investigating phenotypes of pulmonary COVID-19 recovery: A longitudinal observational prospective multicenter trial Clustering analysis reveals different profiles associating long-term post-COVID symptoms, COVID-19 symptoms at hospital admission and previous medical co-morbidities in previously hospitalized COVID-19 survivors Characteristics and impact of Long Covid: Findings from an online survey Characterizing long COVID in an international cohort: 7 months of symptoms and their impact A longitudinal footprint of genetic epilepsies using automated electronic medical record interpretation 000 Genomes Project Pilot Investigators et al. 100,000 Genomes Pilot on Rare-Disease Diagnosis in Health Care -Preliminary Report Assessing the robustness of cluster solutions obtained from sparse count matrices High-dimensional characterization of post-acute sequelae of COVID-19 Dexamethasone in Hospitalized Patients with Covid-19 Corticosteroids for COVID-19: the search for an optimum duration of therapy Sex Disparities in COVID-19 Severity and Outcome: Are Men Weaker or Women Stronger? Neurological manifestations of long-COVID syndrome: a narrative review Persistence of neuropsychiatric symptoms associated with SARS-CoV-2 positivity among a cohort of children and adolescents Female Sex Is a Risk Factor Associated with Long-Term Post-COVID Related-Symptoms but Not with COVID-19 Symptoms: The LONG-COVID-EXP-CM Multicenter Study The GA4GH Phenopacket schema: A computable representation of clinical data for precision medicine Positive Predictive Value of the ICD-10 Diagnosis Code for Long-COVID Characterization of Autonomic Symptom Burden in Long COVID: A Global Survey of 2,314 Adults Phenotype ontologies and cross-species analysis for translational research The Human Phenotype Ontology in 2017 The human phenotype ontology The Human Phenotype Ontology: Semantic Unification of Common and Rare Disease The Human Phenotype Ontology project: linking molecular biology and disease through phenotype data OMOP2OBO Semantic integration of clinical laboratory tests from electronic health records for deep phenotyping and biomarker discovery OBO Foundry in 2021: operationalizing open data principles to evaluate ontologies Semantic similarity in biomedical ontologies k-means++: the advantages of careful seeding K-means clustering: a half-century synthesis Dissolution point and isolation robustness: Robustness criteria for general cluster analysis methods Robustness Properties of k Means and Trimmed k Means Evaluation and selection of clustering methods using a hybrid group MCDM The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment This study is part of the NIH Researching COVID to Enhance Recovery (RECOVER) Initiative (https://recovercovid.org/), which seeks to understand, treat, and prevent the post-acute sequelae of SARS-CoV-2 infection (PASC).