key: cord-1029362-94pgth75 authors: Ruiz-Sternberg, Ángela María; Chaparro-Solano, Henry Mauricio; Albornoz, Ludwig Luis Antonio; Pinzón-Rondón, Ángela María; Pardo-Oviedo, Juan Mauricio; Molano-González, Nicolás; Otero-Rodríguez, Diego Andrés; Zapata-Gómez, Fabio Andrés; Gálvez-Bermúdez, Jubby Marcela title: GENOMIC CHARACTERIZATION OF SARS-CoV-2 AND ITS ASSOCIATION WITH CLINICAL OUTCOMES: A ONE-YEAR LONGITUDINAL STUDY OF THE PANDEMIC IN COLOMBIA date: 2021-12-15 journal: Int J Infect Dis DOI: 10.1016/j.ijid.2021.12.326 sha: 32710384fb0cafff2f02619d5d47ef865a4ade2a doc_id: 1029362 cord_uid: 94pgth75 Objectives This study aims to explore the association between the molecular characterization of SARS-CoV-2 and disease severity on ambulatory and hospitalized patients in two main Colombian epicenters during the first year of the COVID-19 pandemic. Methods We included 1000 patients with SARS-CoV-2 infection, collected clinical data from 997, and obtained 678 whole genome sequences by massively parallel sequencing. Bivariate, multivariate, and classification and regression tree analyses were run between clinical and genomic variables. Results Age and infection with lineages B.1.1, B.1.1.388, B.1.523, and B.1.621 were related to lethality for patients 71-88 years old (OR: 6.048036; 95% CI 1.346567-32.92521, p-value: 0.01718674). The need for hospitalization was associated with higher age and comorbidities. For patients 38-51 years old infected with lineages A, B, B.1.1.388, B.1.1.434, B.1.153, B.1.36.10, B.1.411, B.1.471, B.1.558 or B.1.621, hospitalization rate increased significantly (OR 8.368427, 95% CI 2.573145-39.10672, p-value: 0.00012). Associations between clades and clinical outcomes diverged from previously reported data. Conclusions Lineage B.1.621 increased the need for hospitalization and lethality. Our findings, plus the rapidly increasing prevalence in Colombia and other countries, suggest broadly considering it as a Variant of Interest. If associated disease severity is confirmed, possible designation as Variant of Concern could be entertained. SARS-CoV-2, an RNA virus from the coronavirus family whose genome contains 29.8 Kb, has emerged as a new viral pathogen that causes COVID-19 respiratory disease. Due to its important transmission capabilities, this virus led to an unprecedented pandemic in human history, officially declared by the World Health Organization (WHO) in March 2020. By July 29, 2021, over 196 million cases have been reported and around 4 million deaths have been documented worldwide (Johns Hopkins Coronavirus Resource Center, 2021). Colombia (estimated 2020 population of 50.3 million) has been especially affected over time, with 4,877,323 cases and 123,781 deaths reported by the submission date of this article. Since the beginning of the pandemic, the country has faced three COVID-19 waves (July-August 2020; January 2021; and April 2021-present) . The latter has been the most aggressive, representing the second-highest worldwide number of daily new cases and deaths since May 2021 (Coronavirus Colombia, 2021) . SARS-CoV-2 infection can have a wide spectrum of clinical outcomes, from asymptomatic infection to severe disease and death. Even though there are well-known sociodemographic and clinical risk factors related to COVID-19 clinical presentation, the influence of the viral mutational profile in infectivity and severity of the disease is yet to be fully elucidated (S.-W. Huang & Wang, 2021; Richardson et al., 2020) . All viruses undergo genomic changes as they spread, but such variations mostly do not imply a structural or functional impact on protein translation (Peacock et al., 2021) . Since the complete sequence publication in December 2019, the SARS-CoV-2 genome has been thoroughly characterized, leading to the description of genes and regions that are important for host recognition and cellular entry, as well as immune response evasion. Different nomenclature systems based on the identification of mutation markers, like the Global Initiative on Sharing All Influenza Data -GISAID-that define 8 major clades (S, L, V, G, GH, GR, GV, and GRY), and/or genetic, epidemiological and geographical characteristics, such as the Phylogenetic Assignment Named Global Outbreak Lineages, -PANGO-, have been proposed (Elbe & Buckland-Merrett, 2017; Rambaut et al., 2020; Shu & McCauley, 2017) . These systems are useful for tracking pandemic viral spread, allowing to explore a possible relation between novel genetic variants, lineages or clades, and disease severity, even though by themselves these genetic variations may not suffice to explain viral phenotypic characteristics. As a result of the PANGO system implementation and genomic surveillance programs established by countries around the world, an increasing number of lineages and variants have been described. Despite important efforts and investments made for the continuous sequencing of the SARS-CoV-2 genome, an insignificant proportion of variants have been recognized as epidemiologically or clinically relevant. These variants, called Variants of Concern (VOCs), Variants of Interest (VOIs), and Variants of High Consequence (VOHs), demand higher interest from governments and public health agencies since they contain changes that modify viral transmissibility, disease severity, and response to therapeutic and diagnostic tools (Janik et al., 2021) . It has been hypothesized that these variants are the result of selective pressure due to changes in host immune characteristics as well as the development of new drugs, immunotherapy, and vaccines. The first VOC was the Alpha variant (B.1.1.7 lineage), identified in September 2020 in England (Galloway et al., 2021) . Since then, other VOCs and VOIs have been reported, some displaying convergent mutations that confer the virus functional adaptive characteristics. These variants have become predominant and exhibit higher transmissibility and/or a significant impact on immunity and disease severity (Tracking SARS-CoV-2 Variants, 2021) . Few studies have intended to longitudinally study the possible associations between the mentioned lineages, clades, or other classification systems, and clinical outcomes (Hamed et al., 2021; Lamptey et al., 2021; Nakamichi et al., 2021; Young et al., 2021) . This study aims to explore the association between the molecular characterization of SARS-CoV-2 and disease severity on ambulatory and hospitalized patients from two main cities in Colombia during the first year of the pandemic. Informed consent was obtained from eligible patients. Prospective specimens: Confirmed SARS-CoV-2 respiratory tract specimens (nasopharyngeal aspirate or swab), RT-qPCR positive, were collected from patients recruited from two main pandemic epicenters in Colombia, at tertiary-care university hospitals and a molecular diagnostics laboratory. Retrospective specimens: RNA eluate or primary nasopharyngeal swabs/aspirates were obtained from biorepositories at the participating research centers. RT-qPCR negative samples were excluded. Demographic and clinical characteristics were collected in CASPIO (Caspio, Inc. Sunnyvale, California). Viral RNA inactivation and extraction were performed on 0.2 ml aliquots of viral transport medium (primary sample swab specimens), or on 1 mL aliquots in sterile isotonic saline solution (primary aspirate samples). All specimens were heat-inactivated (56 ºC for 30 minutes) and manipulated under BSL level 2 conditions. RNA extraction consisted of cell lysis, followed by bead binding to magnetic rods, RNA binding to beads, washing, and elution, to obtain 0. Qualitative variables were reported as frequencies and percentages. Quantitative variables were reported as means and standard deviations or median and interquartile ranges depending on normality distribution. To assess associations between viral genome characteristics (presence or absence of genetic variants, total number of variants per sample, total number of variants discriminated by gene and impact of the variant in the protein, PANGOLIN lineage, and GISAID clade) and death and need for hospitalization, we used the Kruskall-Wallis test or the Chi-Square of independence, correspondingly. In a second approach, we used the Classification and Regression Trees (Breiman et al., 2017) algorithm to find the relevant variables associated with death and hospitalization. This algorithm is useful since the number of covariates that can be included in the model is not limited, as is the case in more traditional approaches like logistic regression. We included as covariates the following: sex, age, number of comorbidities, asymptomatic status, BMI, and the aforementioned genetic characteristics. The overall significance level was set at 5%. For all statistical analyses software R version 4.0.2 was used. All SARS-CoV-2 genomes were downloaded from SOPHiA™ DDM® bioinformatics software (SOPHiA Genetics Inc.). Fasta files were aligned to the reference genome, NC_045512, using MAFFT v7 software (Katoh et al., 2002) . Next, a nucleotide substitution model was predicted using jMoldelTest v2.1.10 (Posada, 2008) . Later, we constructed a maximum likelihood tree with IQ-TREE 2 software (Minh et al., 2020) using the GTR + Γ model and 1000 bootstrap replicates. Finally, each genome had a lineage assigned using the PANGOLIN webserver (Rambaut et al., 2020) . Additionally, CoVsurver online server (CoVsurver -CoronaVirus Surveillance Server, 2021) was used for GISAID clade assignment. The present project was approved by the IRB of Universidad del Rosario and of participating hospital Research Centers. All international and national bioethical principles and regulations for clinical investigation in human subjects are followed. Clinical and demographic information was obtained from 997 patients. The mean age was 50.6 years, 35% of participants were under 40 years and 33.7% were over 60 years. Sex distribution in the cohort was homogeneous. The ethnic majority accounted for 76.6% of the sample. Patients resided in Bogotá (62.2%) and in Cali (29%) ( Table 1) . At diagnosis, 90.7% of the patients had symptoms. Outpatients represented 50.8%, 33% were hospitalized, 9.2% received ICU support, and 6.9% died. The most frequent complications were respiratory (29.4%); symptoms were cough (54.5%), fatigue (52.4%), and fever (47.1%) ( Table 2) . We obtained 763 SARS-CoV-2 sequences, of which genomic coverage was >95% in most samples (63.4%). In 10.2% of cases, coverage was 75-95%; 25-75% and <25% coverage occurred in 13.4% and 13% of sequences, respectively. A total of 2,715 single variants were identified: missense variants (54.1%), synonymous variants (37.75%) and Loss-of-Function -LoF-(3.6%); the remaining 4.7% included variants in the untranslated (3'UTR and 5'UTR) and intergenic regions, as well as in-frame, InDels, loss-of-start, and loss-of-stop codons variants. Most genetic changes occurred in ORF1ab (63.5%), S (13.3%), and N (6.2%). When adjusted for kb, the highest rates were in ORF8 (357.51 variants/kb), 3'UTR (218.34 variants/kb), and N (183.92 variants/kb); the lowest rates included ORF1ab (80.97 variants/kb), S (94.19 variants/kb), and M (101.64/kb). Due to poor genomic coverage, 85 sequences were discarded from phylogenetic analysis. Additionally, 5 samples did not pass the Chi-square test performed by Iqtree and were thus excluded. The maximum likelihood tree constructed yielded one major group (658 samples). The remaining 15 samples were clustered into 7 minor branches, more closely related to the original strain ( Figure 1 ). We identified 50 PANGO lineages, being B.1 most prevalent (45.0%) followed by B.1.111 (11.4%) and B.1.1.348 (9.7%). Interestingly, B.1.621, the so-called "Colombian Variant", was found in 7 cases (1.0%) ( Table 3) . Concerning GISAID clades, GH was predominant (48.1%); G, GR, and "other" clades were found in 24.8%, 19.8%, and 7.1% of cases, respectively. S and GRY clades were detected each in a single sample. A higher association with death was found in male patients over 60 years old and with several comorbidities. Multi-organic complications were associated with higher fatality. A distinct relationship was identified between mortality and the level of schooling: the death rate rose in patients with low educational levels (Table 4) . Hospitalization risk progressively increased as a function of age. Male sex, lower education level, and most comorbidities were associated with higher hospitalization requirements. Interestingly, history of current or previous smoking was inversely related to hospitalization rate (Table 5) . G and GR clades predominated in residents of Bogotá and GH in residents of Cali. Associations between clades and comorbidities were identified: GH and "Others" with diabetes mellitus type 2 (DM2), malignancy, and obesity, while clade G with nephropathy. Symptoms such as cough and fatigue were seen more frequently in patients infected with G clade, while nasal congestion with GR0 and respiratory and renal complications with clades grouped as "Other". Need for hospitalization and ICU care were associated with clade G and "Other", while clades GH and GR predominated in outpatients. Sequences classified as "Others" were more frequent in patients over 59 years old (Table 6 ). GISAID clades composition changed throughout the study time window (Figure 2 ). Clade G and "others" were more abundant during the first months. The prevalence of clade GR grew from 4% initially to 26% at the study conclusion. Similarly, clade GH prevalence rose from 21% to 64%. Regarding the results of the decision tree, Figure 3 (Figure 4 ). Despite the great number of genetic variants and lineages identified among the obtained sequences, and consistent with previously published evidence (Peacock et al., 2021) , just two were associated with both increased hospitalization and fatality rates. Firstly, B.1.621, recently labeled as a "Colombian lineage", was detected in Colombia on January 11, 2021, by the National Institute of Health (INS) (Laiton-Donato et al., 2021) . Interestingly, in the present study we report the detection of this lineage in a sample from September 2020, collected in Bogotá. B.1.621 has disseminated nationally with significant acceleration since March 2021, reaching 26% accumulated prevalence, and peaking with 71% prevalence (7day rolling average) by the end of July 2021, according to data uploaded to GISAID by the Colombian National Genomic Surveillance Program, as per the online data aggregating and lineage/mutation tracker outbreak.info (Elbe & Buckland-Merrett, 2017; Shu & McCauley, 2017 We adhere to the growing concern over B.1.621 as reflected by its designation as a VOI by the ECDC. Admittedly, and previous to our findings, there was insufficient real-world, experimental, or model-based evidence as pertains to the impact of B.1.621. Such lack of evidential substantiation probably reflects that B.1.621 worldwide prevalence is very low, reportedly less than 0.5% (Elbe & Buckland-Merrett, 2017; Shu & McCauley, 2017 The latter is coupled with case restriction to well-delimited regions in a specific geography, allowing to fulfill criteria for VOI consideration (Janik et al., 2021) . We report that lineage B.1.1.388 is associated with a higher hospitalization rate and lethality. B.1.1.388 was initially and almost exclusively reported in Colombia (until recently in Ecuador and Spain), triggering PANGO to label it as another "Colombian lineage" (Rambaut et al., 2020) , and displays several distinctive substitutions that have not been designated as neither VOI nor VOC by the date of this paper's submission. A high percentage of patients had symptoms (90.7%), mostly with influenza-like illness. Clinical presentation severity allowed for outpatient management in 50.8% of cases; 33% required hospitalization and 9.2% ICU admission; lethality was 6.9%. Admission rates to ICU coincide with reports in the literature (5%-32%) (Guan et al., 2020; . In agreement with others, pulmonary complications presented predominantly (29.4%). Other systems had minor involvement, mainly associated with the multisystemic impact of the disease. As in most series, conditions most frequently associated with hospitalization or ICU admission were age over 60 years, male sex, hypertension, cardiovascular disease, nephropathy, obesity, or thyroid disease Guan et al., 2020; Richardson et al., 2020; Wu et al., 2020 In the bivariate analysis, we found a significant association between mortality and clades G and "others" possibly because patients over 60 years old and with comorbidities were overrepresented in these two clades, leading to possible bias. Furthermore, in the first phase of the study, a higher proportion of G and GH clades were identified, as opposed to GH and "others" in the second. This association between mortality and clades G and "others" was no longer seen on the multivariate analysis. Several studies have explored associations between clinical outcomes and SARS-CoV-2 clades, and findings have been broadly divergent. Hamed et al, found that GH and GR clades were associated with severe/deceased outcomes, while S, G, and GV were associated with mild/asymptomatic cases. Clades L and V showed no significant statistical association (Hamed et al., 2021) . Young et al, showed clade L/V to have a significant association with severity and a more intense systemic inflammatory response, while clade G was not associated with higher severity or transmissibility (Young et al., 2021) . and mortality due to SARS-CoV-2 infection, designating two clear clades from hierarchical clustering of the sequence variants. Clade 2, predominantly composed of S clade, showed a trend toward poorer clinical outcomes compared with Clade 1, predominantly constituted by the GH clade (Nakamichi et al., 2021) . Taxonomic classification into clades provides for a relatively coarse characterization, possibly lacking sufficient granularity to do clinical correlation because clades are constituted by lineages designated or not as VOCs or VOIs. Additionally, clade composition could be modified over time, depending on the identification of new lineages and the understanding of the clinical impact on COVID-19 severity that previously designated lineages could have. In this sense, the allocation of a VOC or VOI in a specific clade could give a false perception that the clade by itself is the variable associated with higher severity or transmissibility, in place of lineage, which may indeed be what relates with a worse clinical severity outcome. For instance, the Alpha variant belongs to the GRY clade (previously called GR); Beta, Epsilon and Iota to GH; Gamma, Zeta, Theta and Lambda to GR; and Delta, Eta, and Kappa to G. Such a wide distribution of VOIs and VOCs hinders the exploration of possible associations (Tracking SARS-CoV-2 Variants, 2021) . Finally, these are dynamic lineage groupings prone to reclassification and reallocation. In this study, we have described the association between SARS-CoV-2 lineages and the rates of patient hospitalization and fatality. Our findings, in context with that of others, make plausible the consideration of lineage B.1.621 as a VOI. In our view, VOI designation of lineage B.1.621 merits consideration due to the fixation and significant increase in the detection frequency over a relatively short interval, and because of the high detection rate within the protracted third wave of SARS-CoV-2 infection in Colombia, which was the third-largest COVID-19 caseload in Latin America, twelfth 12 globally. As the disease severity for this lineage is better characterized in further studies, a possible designation as a VOC could be entertained. This is a cohort study, viewed in contrast to GISAID data and ecological studies. We suggest that in public databases such as GISAID, clinical information associated with the sequence data would be beneficial if made available, to foment a timelier association of genomic data with clinical variables. This may prompt a more expedient consideration for variant classification as VOCs or VOIs, in turn, triggering strict surveillance in terms of public health and other policies related to the management of the pandemic. Our study included patients with follow-up until clinical outcome definitions were met, ensuring the fidelity of information collected. In addition, the twelve-month temporal coverage sheds light on the evolution of SARS-CoV-2 and the dynamics related to the introduction of the pathogen from other countries. Preanalytical specimen management included a unique platform for obtaining sequences and automated library preparation, thus controlling for cross-contamination and operator-dependent error. This study has limitations. First, it uses a convenience sample, which limits its generalizability. While participants come from the largest and third-largest cities in Colombia, admittedly the major national outbreak epicenters especially in the first half of 2020, Caribbean coastal populations were excluded, in whom most of the emerging B.1.621 variant cases are detected. Second, we had a 67% rate of successful sequencing: failure was mostly due to lower viral sample contents and/or RNA degradation. Third, the study design was ambispective: recall bias may have affected the accuracy of symptoms information provided in retrospective cases. Nonetheless, good agreement with the literature leads us to infer that recall bias is probably minor. Finally, patient recruitment concluded shortly before the third, and as of yet most serious, pandemic wave in Colombia, where B.1.621 lineage detection soared, thus explaining its relatively low frequency in our cohort study. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. The present project was approved by the IRB of Universidad del Rosario and of participating hospital Research Centers. All international and national bioethical principles and regulations for clinical investigation in human subjects are followed. .1 blue, B.1.111 red, B.1.1.348 yellow, B.1.1 green, B.1.153 olive green, B.1.420 pink. Other colors are described in HTML code in Table 1 (This figure should be colored). Two variables (age, and four lineages) were associated with higher mortality. This tree identifies patients with commonalities, classified into four subgroups: A) Patients over 88 years old, with mortality of 53%, (2% of the total study sample). B) Patients 71 to 88 years old, who presented with either of the following four lineages: B.1.1, B.1.1.388, B.1.523, or B.1.621 . In this group, 62% of patients were deceased (1% of the sample). C) Patients between 71 and 88 years old, presented with lineages different from the four described above, and had 21% mortality (14% of the sample). D) Patients under 71 years old, who had 3% mortality (84% of the total sample). (This figure should be colored). The classification and regression tree for hospitalization seeks to identify the determining variables of COVID-19 clinical severity, in terms of the need for hospitalization. Six groups were identified: A) Patients over 59 years old presenting with 85% hospitalization rate (36% of the sample). B) Patients 51 to 59 years old, with one or more comorbidities, presented 73% hospitalization rate (7% of the sample). C) Patients 51 to 59 years old, but with no comorbidities, presented a substantial reduction in hospitalizations (36%) (6% of the sample). D) Patients 38 to 51 years old with any of the following viral lineages: A, B, B.1.1.388, B.1.1.434, B.1.153, B.1.36.10, B.1.411, B.1.471, B.1.558 or B.1.621, presented 82% hospitalization (This figure should be colored). Classification And Regression Trees Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: A descriptive study Recuperado el 11 de junio de 2021 Johns Hopkins Coronavirus Resource Center Recuperado el 11 de junio de 2021 Data, disease and diplomacy: GISAID's innovative contribution to global health Emergence of SARS-CoV-2 B.1.1.7 Lineage-United States … China Medical Treatment Expert Group for Covid-19 Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology Clinical features of patients infected with 2019 novel coronavirus in SARS-CoV-2 Entry Related Viral and Host Genetic Variations: Implications on COVID-19 Severity, Immune Escape, and Infectivity The Emerging Concern and Interest SARS-CoV-2 Variants MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform Characterization of the emerging B.1.621 variant of interest of SARS-CoV-2. Infection Genomic and epidemiological characteristics of SARS-CoV-2 in Africa IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era Hospitalization and mortality associated with SARS-CoV-2 viral clades in COVID-19 SARS-CoV-2 one year on: Evidence for ongoing viral adaptation jModelTest: Phylogenetic model averaging A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Presenting Characteristics, Comorbidities, and Outcomes Among 5700 Patients Hospitalized With COVID-19 in the SARS-CoV-2 variants of concern and variants under investigation European Centre for Disease Prevention and Control. Recuperado el 17 de agosto de 2021 GISAID: Global initiative on sharing all influenza data -from vision to reality Recuperado el 17 de agosto de 2021 Risk Factors Associated With Acute Respiratory Distress Syndrome and Death in Patients With Coronavirus Disease Association of SARS-CoV-2 clades with clinical, inflammatory and virologic outcomes: An observational study Anosmia 207 (31.6%)