key: cord-1026641-k1rgljmu authors: Daga, S.; Fallerini, C.; Baldassarri, M.; Fava, F.; Valentino, F.; Doddato, G.; Benetti, E.; Furini, S.; Giliberti, A.; Tita, R.; Amitrano, S.; Bruttini, M.; Meloni, I.; Pinto, A. M.; Raimondi, F.; Stella, A.; Biscarini, F.; Picchiotti, N.; Gori, M.; Pinoli, P.; Ceri, S.; Sanarico, M.; Crawley, F. P.; GEN-COVID Multicenter Study,; Renieri, A.; Mari, F.; Frullanti, E. title: Employing a Systematic Approach to Biobanking and Analyzing Genetic and Clinical Data for Advancing COVID-19 Research date: 2020-07-24 journal: nan DOI: 10.1101/2020.07.24.20161307 sha: a6031bf55e3d6e5ac386b1bd8107c75c851204e9 doc_id: 1026641 cord_uid: k1rgljmu Within the GEN-COVID Multicenter Study, biospecimens from more than 1,000 SARS-CoV-2-positive individuals have thus far been collected in the GEN-COVID Biobank (GCB). Sample types include whole blood, plasma, serum, leukocytes, and DNA. The GCB links samples to detailed clinical data available in the GEN-COVID Patient Registry (GCPR). It includes hospitalized patients (74.25%), broken down into intubated, treated by CPAP-biPAP, treated with O2 supplementation, and without respiratory support (9.5%, 18.4%, 31.55% and 14.8, respectively); and non-hospitalized subjects (25.75%), either pauci- or asymptomatic. More than 150 clinical patient-level data fields have been collected and binarized according to the organs/systems primarily affected by COVID-19: heart, liver, pancreas, kidney, chemosensors, innate or adaptive immunity, and clotting system, for further statistics. Hierarchical Clustering analysis identified five main clinical categories: i) severe multisystemic failure with either thromboembolic or pancreatic variant; ii) cytokine storm type either severe with liver involvement or moderate; iii) moderate heart type either with or without liver damage; iv) moderate multisystemic involvement either with or without liver damage; v) mild either with or without hyposmia. GCB and GCPR are further linked to the GEN-COVID Genetic Data Repository (GCGDR), which includes data from Whole Exome Sequencing and high-density SNP genotyping. The data are available for sharing through the Network for Italian Genomes, within the COVID-19 dedicated section. The study objective is to systematize this comprehensive data collection and start identifying multi-organ involvement in COVID-19, defining genetic parameters for infection susceptibility within the population, and mapping genetically COVID-19 severity and clinical complexity among patients. The GEN-COVID Multicenter Study was designed to collect and systematize biological samples and clinical data across multiple hospitals and health facilities in Italy with the purpose of deriving patient-level phenotypic and genotypic data and the specific intention to make samples and data available to COVID-19 researchers globally. To reach these aims, the project collected and organized high-quality samples and data whose integrity was assured and could be readily accessed and processed for COVID-19 research using existing interoperability standards and tools. To this end, a GEN-COVID Biobank (GCB) and a GEN-COVID Patient Registry (GCPR) were established utilizing already existing biobanking and patient registry infrastructure. The collection of samples and data are now utilized in the GEN-COVID Multicenter Study for generating Genotyping (GWAS) and Whole Exome Sequencing (WES) results. This study also works collaboratively with other genomic studies on COVID-19. The data resulting from these studies is then stored and made available through the GEN-COVID Genetic Data Repository (GCGDR). All samples and data have also been systematized in accordance with the FAIR (Findability, Accessibility, Interoperability, and Reuse) Data Principles [1] to promote their international availability and use for COVID-19 research. The outbreak of the coronavirus disease 2019 (COVID-19), the Severe Acute Respiratory Syndrome due to coronavirus SARS-CoV-2, that first appeared in December 2019 in Wuhan, Huanan, Hubei province of China, has resulted in millions of cases worldwide within a few short months, and rapidly evolving into a real pandemic [2] . The COVID-19 pandemic represents an enormous challenge to the world's healthcare systems. Among the European countries, Italy was the first to experience the epidemic wave of SARS-CoV-2 infection, accompanied by a severe clinical picture and a mortality rate reaching 14%. In Italy, as of July 16th, 2020, there were 243,506 confirmed COVID-19 cases and 34,997 related deaths reported [3] . The disease is characterized by a highly heterogeneous phenotypic response to SARS-CoV-2 infection, with the large majority of infected individuals having only mild or even no symptoms. However, the severe cases can rapidly evolve towards a critical respiratory distress syndrome and multiple organ failure. The symptoms of COVID-19 range from fever, cough, sore throat, congestion, and fatigue to shortness of breath, hemoptysis, pneumonia followed by respiratory disorders and septic shocks [4] . The overburdened healthcare infrastructure and the working conditions within healthcare centers are tremendously challenging. Direct patient care is given the highest priority. Focus is concentrated on monitoring infection evolution in terms of the number of new cases and the number of deaths. Disease severity is also an important parameter that is being continually evaluated, with a current focus on patients experiencing serious pulmonary disease and other life-threatening conditions. Although patient care is the first priority, in the public health emergency situation brought on by the COVID-19 pandemic, it is also of the utmost importance to collect, process, and share with rapidity and confidence human biological materials, clinical data,, and study outcomes. The best suited tool to address this need and accelerate research on COVID-19 is an accessible, high quality biobank with associated clinical data and the necessary tools to guarantee interoperability with other biobanks and databanks. This paper addresses the main aim of the project: the collection and systematization of human biological materials, clinical data stored in a patient registry, and derived patient-level genetic data. The paper addresses the methods for sample and data collection and the systematization of the samples and data for research purposes. As COVID-19 increasingly reveals itself as a multi-systemic disease, the purpose of this data collection is to include the most relevant clinical variables that identify multi-organ involvement as well as identifying the genetic determinants of virus-host interaction, so as to holistically disclose the effect of COVID-19 over several physiological subsystems. In the present paper, the samples and the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint complete datasets are then used within the GEN-COVID Multicenter Study for identifying multi-organ involvement in COVID-19, defining genetic parameters for infection susceptibility within the population, and mapping genetically COVID-19 severity and clinical complexity among patients. Going forward, the main challenge will be to define the genetic parameters for infection susceptibility within specific populations in order to be able to map and identify genetically COVID-19 severity and clinical complexity within and across patient groups. The purpose of the GEN-COVID Multicenter Study is to make the best use of the widest possible sets of patient data and genetic material in order to identify potential links between patient genetic variation and clinical variability, patient presentation and disease severity. By exposing the potential links between genetic variability and disease variability, the study believes it can contribute to improved patient-level diagnostics, prognosis, and personalized treatment of COVID-19. To achieve this overall aim, the following specific objectives are being pursued: i) to perform sequencing (WES) on 2,000 COVID-19 patients iii) to associate the host genetic data obtained on 2,000 COVID-19 patients with severity and prognosis; iv) to share phenotypic data and samples across the GEN-COVID consortium platform as well as in cooperation research institutions and national platforms through the GEN-COVID Disease Registry and Biobank; v) to share genetic data through the Network of Italian Genome NIG (NIG, http://www.nig.cineca.it/, NIG database, http://nigdb.cineca.it) at CINECA, the largest Italian computing center. Planned key deliverables of the project are i) to develop a state-of-the-art Patient Registry and Biobank for COVID-19 clinical research with access for academic and industry partners; ii) to understand the genetic and molecular basis of susceptibility to SARS-CoV-2 infection and (susceptibility to a potentially more severe clinical outcome [prognosis] within 12 months); iii) to understand the genetic profile of patients, contributing to the rapid identification of medicines to be repurposed for personalized therapeutic approaches that demonstrate greater efficacy against the COVID-19 virus. As the initial starting point of this process, the ACE2 gene has already been extensively investigated in Italian population [5] . were collected for all subjects. In order to ensure a collection that could be, as much as possible, comprehensive and representative of the Italian population, hospitals from across Italy, local healthcare units, and departments of preventive medicine were involved in collecting samples and associated patient-level data for the GEN-COVID Multicenter Study. The inclusion criteria for the study are PCR-positive SARS-CoV-2 infection, age ≥ 18 years, and appropriately given informed consent. In addition to the samples' collection, an extensive questionnaire was used to assess . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint disease severity and collect basic demographic information from each patient (Supplementary Table 1) . As of July 16th, 2020, we have collected samples and data from 1,033 individuals (1021 without family ties and 12 with family relations) positively diagnosed with SARS-CoV-2 who developed a wide range of disease severity, ranging from hospitalized patients with severe COVID-19 disease to asymptomatic individuals. The GEN-COVID registry was designed in order to guarantee data accuracy and, at the same time, to ensure ease of data entry in order to save clinicians time and facilitate compliance. The highest data integrity and data privacy standards, with reference to the EU General Data Protection Regulation (GDPR) [6] , were also built into the training for personnel. Samples and data were also collected and systematized in order to meet the FAIR Data Principles requirements. The socio-demographic information included sex, age, and ethnicity. Information about family history, (pre-existing) chronic conditions, and SARS-CoV-2 related symptoms were also collected through a detailed core clinical questionnaire as previously reported [7] . This clinical data was continually updated accordingly as new information appeared regarding COVID-19 (Supplementary Table 1 ). More than 150 clinical items have been collected and synthesized in a binary mode for each involved organ/system: heart, liver, pancreas, kidney, olfactory/gustatory and lymphoid systems. The collection and organizing . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint methodologies allowed for rapid statistical analysis. Data were handled and stored in accordance with the EU GDPR [6] . Peripheral blood samples in Ethylenediamine tetraacetic acid (EDTA)-containing tubes were collected for all subjects. Genomic DNA was centrally isolated from peripheral blood samples using the MagCore ® Genomic DNA Whole Blood Kit (Diatech Pharmacogenetics, Jesi, Italy) according to the manufacturer's protocol at the Promoter Center. For all subjects, aliquots of plasma and serum are also available. Whenever possible, leukocytes were isolated from whole blood by density gradient centrifugation and stored in dimethyl sulfoxide (DMSO) solution and frozen using liquid nitrogen. For the majority of cohort, swab specimens are also available and stored at the reference hospitals. Genetic data from GWAS and WES were generated for all patients. The generation of such a massive amount of sequencing data required sufficient computing resources able to store and analyse large quantities of data. For this purpose, GEN-COVID took advantage of University of Siena's participation in the Network for Italian Genomes (NIG, http://www.nig.cineca.it/, NIG database, http://nigdb.cineca.it/), which collects genome sequencing data from the Italian population. NIG has a specific agreement with CINECA, the largest computing centre in Italy and one of the largest in Europe, for the use of the CINECA facility for the storage and analysis of data. Data upload followed quality and regulatory requirements already in place to ensure adequate uniformity and homogeneity levels. Data were formatted to meet the requirements of the FAIR Data Principles and thus made interoperable with other FAIR omics data and reference databases. A continuous quantitative respiratory score, the PaO 2 /FiO 2 [Partial pressure of oxygen/Fraction of inspired oxygen ratio (P/F)] was assigned to each patient as an indicator of the respiratory involvement. Taking the normal value >300 as the threshold, we defined . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint four grades of severity score for the PaO 2 /FiO 2 ratio: P/F less than or equal to 100, between 101 and 200, between 201 and 300, and greater than 300. A P/F value is not available for non-hospitalized subjects because the test is only performed in hospitalized patients when needed. Heart involvement was considered on the basis of one or more of the following abnormal data: a cardiac Troponin T (cTnT) value higher than the reference range (<15 ng/L) (indicative of ischemic disorder), an increase in the N-terminal (NT)-pro hormone BNP (NT-proBNP) value (reference value <88 pg/ml for males and <153 pg/ml for females) (indicative of heart failure), and the presence of arrhythmias (indicative of electric disorder). Hepatic involvement was defined on the basis of a clear liver enzymes elevation as Alanine transaminase (ALT) and Aspartate transaminase (AST) higher than the gender specific reference value (for ALT <41 UI/L in males and <31 UI/L in females; for AST <37 UI/L in males and <31 UI/L in females). Pancreatic involvement was considered on the basis of pancreatic enzymes as pancreatic amylase (PA) and lipase (PL) higher or lower than their specific reference range (13-53 UI/l for PA and 13-60UI/l per PL). Kidney involvement was defined in the presence of a creatinine value higher than the gender specific reference value (0,7-1,20 mg/dl in males and 0,5-1,10 mg/dl in females). Lymphoid system involvement was designated as Natural killer (NK) cells and/or peripheral CD4 + T cells below reference value (NK cells>90 cell/ul (mm ^3); CD4+T cells>400 cell/ul (mm^3)). For each patient a numerical grading for the olfactory and gustatory dysfunction was defined through a clinical questionnaire, administered by ENT specialists. D-Dimer values > 10X with or without low Fibrinogen level was used to interpret the involvement of the blood clotting system. Interleukin 6 (IL6), lactate dehydrogenase (LDH) and c-reactive protein (CRP) values above the reference range (<0,5 mg/dl for CRP and 135-225 UI/l in males and 135-214 UI/l in females for LDH) were used to determine proinflammatory cytokines system involvement. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint Whole Exome Sequencing with at least 97% coverage at 20x was performed using the Illumina NovaSeq6000 System (Illumina, San Diego, CA, USA). Library preparation was performed using the Illumina Exome Panel (Illumina) according to the manufacturer's protocol. Library enrichment was tested by qPCR and the size distribution and concentration were determined using Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). The Novaseq6000 System (Illumina) was used for DNA sequencing through 150 bp paired-end reads. Genotyping data on 700,000 genetic markers were obtained on genomic DNA using the Illumina Global Screening Array (Illumina) according to the manufacturer's protocol. Homo sapiens (human) Genome Reference Consortium Human Build 38 (GRCh38) was used. Quality checks (SNP calling quality, cluster separation, and Mendelian and replication error) were done using GenomeStudio analysis software (Illumina). The computer package Plink v1.90 [8] was used to process 700k SNP-genotyping data and to calculate SNP genotype statistics. Descriptive statistics were calculated to determine the distribution of clinical features by sex, age, and ethnicity. Chi-square tests were used to evaluate the statistical association between the clinical severity of the disease (from no hospitalization to intubation) and the categorical clinical variables: gender, ethnicity, blood group, respiratory severity, taste/smell involvement, heart involvement, liver involvement, pancreas involvement, kidney involvement, lymphoid involvement, cytokines trigger, D-dimer, number of comorbidities. A linear regression model was used to test the statistical association between COVID-19 severity and age. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint The variability within clinical features and their relative relationships have been summarized and described by principal component analysis (PCA). Only numerical variables with a missing rate lower than 50% were selected; these included: hyposmia, neutrophils, CRP, fibrinogen, LDH, D-dimer, number of comorbidities. Missing data were imputed using KNN (k-nearest neighbor) imputation [9] , based on Gower distances [10] . After imputation, variables were centered and scaled prior to PCA. Descriptive statistics, chi-square tests, linear regression, and PCA were performed with the R environment for statistical computing [11] . The GEN-COVID Multicenter Study, through a cooperative and carefully curated moded of sample and data collection, has employed rigorous analyses to achieve phenotypic and genotypic data that can now be used to begin to identify host genetic dispositions to COVID-19. The careful methodological approach across a large geographical area to develop a biobank (the GCB), a registry (the GCPR), and finally the resulting genetic data collection (the GCGDC). Following the timelines and milestones of the GEN-COVID Multicenter Study (see Figure 1) , the study has achieved a COVID-19 biobank, registry, and genetic data collection linked to one another, providing a high degree of confidence in sample and data integrity, and open to the world for COVID-19 research early on in this pandemic. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . The GEN-COVID Biobank (GCB), a collection of bio-specimens from patients Collected biological samples include peripheral blood, plasma, serum, primary leukocytes and DNA samples. Samples were stored in a dedicated biobank section while associated clinical data were entered in the related registry. The biobank and registry were organized according to the highest scientific standards, preserving patients' and citizens' privacy, while providing services to the health and scientific community to develop better treatments, test diagnostic tools and advance COVID-19 and coronavirus research. Biobank personnel are responsible for sample pseudonymization, storage, and insertion in the online biobank catalogue. The GEN-COVID Multicenter Study reached a large number of subjects throughout Italy. Tuscany, which is the region in which the study is carried out, contributes presently 22.8% of enrolled patients. The Northern Italian regions, particularly Lombardy and Venetia, . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint currently contribute 52.3% of enrolled patients (Figure 2) . This distribution reflects closely the incidence of SARS-CoV-2 infection per 100,000 inhabitants for each Italian region, as updated to 4 July 2020 [3] . From April 7, 2020 to July 16, 2020, the GEN-COVID Patient Registry (GCPR) collected clinical data from a total of 1033 Italian SARS-Cov-2 PCR-positive individuals. For each individual, we collected clinical information using standardized clinical schedules (Supplementary Table 1 ). The study protocol also provides access to patients' medical records and continual clinical data updating in order to secure continuity for patient followup. The mean age of the entire cohort was 58.7 years (range 18-99). They were predominantly male (57.1%) with a mean age of 59.5 years (range 18-99); the mean age of the females was 57.6 years (range 19-98) ( Subjects were divided into five qualitative severity clinical categories depending on the need for hospitalization, the respiratory impairment and, consequently, the type of ventilation required: i) hospitalized and intubated (9.5%); ii) hospitalized and CPAP-BiPAP and high-flows oxygen treated (18.4%); iii) hospitalized and treated with conventional oxygen support only (31.55%); iv) hospitalized without respiratory support (14.8%); v) not hospitalized pauci/asymptomatic individuals (25.75%) (Group 4 to 0 in Table 1 ). Gender distribution was statistically significantly different among the 5 groups (p-value=7.81x10 -6 ). In the group with high care intensity (Group 4), 72.4% of subjects were . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint male, while in the group with the milder phenotype (Group 0) 59.8% of subjects were female ( (Figure 3) . Figure 3 can be further mined through a clinical reasoning and represented as a binary clinical classification for organ/system damage ( Table 3) . Table 4 shows the prevalence of different organ/systems damage in the 5 different clinical categories based on respiratory failure ( Table 4) . Heart involvement was detected in 55% of cases in Group 4, 39% of subjects in Group 3, 34.1% in Group 2, and 21.6% in Group 1. Liver involvement was present in 72.4% of cases in Group 4, 59.3% in Group 3, 46% in Group 2, and 33.7% in Group 1. Statistically significant difference among the 5 groups was found for all organs/systems, except for the lymphoid system. Finally, Figure 4 shows by dendrogram COVID-19 phenotype can be clustered using the above reported clinical data representation. Hierarchical Clustering analysis identified five main clinical categories and several subcategories: A) severe multisystemic, with either . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint thromboembolic (A1) or pancreatic variant (A2); B) cytokine storm, either moderate (B1) or severe with liver involvement (B2); C) mild, either with (C1) or without hyposmia (C2); D) moderate, either without (D1) or with (D2) liver damage; E) heart type, either with (E1) or without (E2) liver damage (Figure 4) . WES and Genotype (GWAS) data were generated within the GEN-COVID Genetic Data Repository (GCGDR). In order to be able to store and analyse the massive amount of genomic data (mainly WES with coverage > 97% at 20x, and prospetically including also WGS) generated with the analysis of the entire cohort of samples populating the biobank, we relied on the NIG. External users can upload and analyse data using the NIG pipeline by registering and creating a specific project. A section dedicated to COVID-19 samples has been created within the NIG database (http://nigdb.cineca.it/) that provides variant frequencies as a free tool for both clinicians and researchers. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint had a frequency < 0.01. From the genotype perspective, the average observed heterozygosity was 0.047. The data from high-density (700k) SNP genotyping are also generated on the same cohort and shared with international collaborations, including the COVID-19 Host Genetics Initiative (https://covid-19genehostinitiative.net/) and with GoFAIR VODAN [16] . As expected, the majority of subjects in the group with high care intensity (Group 4) were males (72.4%) while in the group with mild phenotype the majority of subjects were females (59.2%). This is confirmatory of previously published data reporting a predominance of males among the most severely COVID-19 affected patients [17] . Among the 767 SARS-. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint CoV-2-positive hospitalized patients, 63% are males and 12.8% required intubation. This is in line with the distribution of the Italian population of hospitalized COVID-19 patients [3] underlining the representativeness of our cohort. Heart involvement was detected in the majority of severe cases (Group 4), confirming again a recent report [18] . Hospitalized SARS-CoV-2 positive patients (Group 2 to 4) have multiple-organ involvement: in particular, heart, liver, pancreas and kidney. In line with our previous data and with literature findings, this confirms that COVID-19 is a systemic disease rather than just a lung disorder [19;20]. Clinical data may be represented and consequently interpreted in different ways. The simplest way of representation is using the raw data of laboratory/instrumental values. In this case, reasoning about which value has to be considered and/or at which time of clinical evolution is necessary in order to have consistency within the cohort. PCA analysis using the worsen value at the time of admission has shown the expected variability with hyposmia to be juxtaposed to the number of comorbidities and thus representing a marker of less severity. The fibrinogen value is juxtaposed to inflammatories markers, such as CRP (and D-Dimer and LDH) because it is consumed during the prothrombotic state. We can conclude that such raw laboratory values are fairly good for representing the clinical variability of the cohort in classical PCA analysis. A more elaborate way of representation of clinical data is to filter the raw laboratory/instrumental values by clinical reasoning, which often requires a face-to-face meeting with organ reference specialists and direct access to the patients' medical records. Table 3 and its distribution against lung dysfunction synthesized in Table 4 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint Involvement of relevant organs or systems is represented in binary and is then used for representing COVID-19 as a systemic disorder (Figure 4) . We propose this representation as one of the best, being closer to the real complexity of the disease. It should be considered for use in further data mining and correlation with genetic data. The emerging clinical categories from Hierarchical Cluster Analysis point to specific types and subtypes, which are more likely to have common genetic factors. As unmasked by our dendrogram (group A), there is indeed a growing body of evidence suggesting that, in addition to the common respiratory symptoms (fever, cough and is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint absence of an inflammatory cascade. This would tend to support the hypothesis that SARS CoV 2 may directly damage myocardial tissue and induce a major cardiovascular event. Thus, as currently recommended, our research reinforces the need to monitor plasma cTnT and NT proBNP levels in COVID 19 patients. In line with current evidence [28; 29], although liver injury seems to occur more frequently among critically ill patients with COVID 19 (group B), it can also be present in non-critically ill patients (groups D and E) and, as suggested, it could be mostly related to prolonged hospitalization and viral shedding duration. This allows defining, for each group, a clinical subclass according to this organ involvement. A recent extensive review determined the prevalence of chemosensory deficits based on pooling together forty-two studies reporting on 23,353 patients [30] . Estimated random prevalence was 38.5% for olfactory dysfunction, 30.4% for taste dysfunction, and 50.2% for overall chemosensory dysfunction. No correlation with age was detected, but anosmia/hypogeusia decreased with disease severity and ethnicity turned out to play a significant role since Caucasians have 3-6 times higher prevalence of chemosensory deficits than East Asians. In accordance with evidence found in the literature, hyposmia was mostly represented among patients in group C with mild clinical symptoms [31] . Similar to the clinical data, large aggregates of genetic data derived from WES may be represented, and consequently interpreted, in different ways. After variant calling, it is possible to use data as such or variants can be prioritized and filtered according to standard bioinformatics procedures [32] , such as damaging effect predictions, healthy population allele frequency, and gene constraints to variation. Alternatively, it is also possible to represent data in a binary mode as follows: i) select missense, splicing, and loss of function variants below 1% (rare variants); ii) select missense, splicing, and loss of function variants between 1% and 5% (low frequency variants); iii) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint select missense, splicing, and loss of function variants above 5% (common polymorphisms) in either homozygosity or supposed compound heterozygosity with rare or low frequency variants. The majority of patients showed about 3% of mutated genes in class i), 5% in class ii) and 28% in class iii) variants (Supplementary Figure 1A) . No patients showed variants in more than 8,000 genes (Supplementary Figure 1B) . Protein interaction network and pathway analysis have been widely used to uncover and describe genetic relationships in complex diseases, such as cancer [33;34] . For example, over-representation analysis of the biological processes and pathways significantly affected by mutations will be instrumental to empower the statistical detection of genetic signatures associated to specific COVID-19 phenotypes and to reduce the number of parameters to consider (e.g. dimensionality reduction) with the purpose of developing robust algorithms for prediction of genetic susceptibility to COVID-19 infection and response. Variants, genes, or biological processes will be employed as features to train interpretable, supervised machine learning classifiers (e.g. gradient boosting decision trees [35;36] ), which will ease the identification of the genetic factors associated with clinical phenotypes. While data collection is being consolidated and brought to completion according to the study design, we have started to work on a relatively new methodology based on Topological Data Analysis to provide a detailed multidimensional and multiscale exploration of the whole exome data and also to drive a selection of the genes that provide higher predictive power in a machine learning model. The method will be presented, together with the results, in a forthcoming paper. Previous attempts to interpret the genetic bases of complex disorders have failed with very few exceptions, even in those disorders in which (like COVID-19) twin studies demonstrated a very high rate of heritability, such as in psychiatric disorders. The reason for . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint this story containing so such a lack of scientific success resides in several weak points: i) the method used to represent the complexity of the phenotype; ii) the procedure employed to represent the huge amount of different genetic data; iii) the absence of a robust mathematical model able to interpret genetic data in non-Mendelian (non-rare) disorders. This paper provides a contribution to the first 2 points, likely paving the way for a solution of the third. and the underlying SARS-CoV-2 virus will rely in large part on human biological materials and patient-level data that is comprehensively collected and systematically organized with careful attention to sample and data integrity as well as the FAIR Data Principles. Improving diagnostics, developing existing or new therapeutics, improving treatment protocols, and even developing public health policies relies upon a foundation of evidence that requires the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint comprehensive, patient, and systematic collection and organizing of COVID-19 patient biological samples and data of high integrity, confidence, and interoperability. The GEN-COVID Multicenter Study's GCPR, GCB, and GCGDR presents a model that can be further explored as a systematic approach to sample and data collection while also being immediately deployable in our collective fight against COVID-19. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted July 24, 2020. . https://doi.org/10.1101/2020.07.24.20161307 doi: medRxiv preprint The FAIR Guiding Principles for scientific data management and stewardship A Novel Coronavirus from Patients with Pneumonia in China Italian Civil Protection Department -COVID-19 case update Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention ACE2 gene variants may underlie interindividual variability and susceptibility to COVID-19 in the Italian population /679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data Clinical and molecular characterization of COVID-19 hospitalized patients Second-generation PLINK: rising to the challenge of larger and richer datasets Missing value estimation methods for DNA microarrays A general coefficient of similarity and some of its properties R: A language and environment for statistical computing Telethon Network of Genetic Biobanks The COVID-19 Host Genetics Initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic Sex difference and smoking predisposition in patients with COVID-19. The Lancet Respiratory Medicine COVID-19: Myocardial injury in survivors Front Med (Lausanne) COVID-19 and Liver Dysfunction: Current Insights and Emergent Therapeutic Strategies Prevalence of Chemosensory Dysfunction in COVID-19 Patients: A Systematic Review and Meta-analysis Reveals Significant Ethnic Differences Clinical characteristics of asymptomatic and symptomatic patients with mild COVID-19 Settling the score: variant prioritization and Mendelian disease Pathway and network analysis of cancer genomes Network propagation: a universal amplifier of genetic associations An interpretable mortality prediction model for COVID-19 patients GEN-COVID Multicenter study. Panel A. Main milestones of the study with the timeline for the 22 Italian hospitals Treviso Hospital, Local Health Unit (ULSS) 2 Marca Trevigiana, Treviso USCA Arezzo; 6: Department of Preventive Medicine Senese, Siena; 7: Department of Preventive Medicine Aretino-Casentino-Valtiberina, Arezzo; 8: Department of preventive medicine Alta Val d'Elsa, Poggibonsi; 9: Department of preventive medicine Amiata Senese e Val d'Orcia -Valdichiana Senese, Montepulciano)