key: cord-0829390-1a2jptg3 authors: Prakrithi, P.; Lakra, P.; The Indian Genome Variation Consortium,; Sundar, D.; Kapoor, M.; Mukerji, M.; Gupta, I. title: Genetic risk prediction of COVID-19 susceptibility and severity in the Indian population date: 2021-04-20 journal: nan DOI: 10.1101/2021.04.13.21255447 sha: 1008450b56754b33a1872f66ef8a7196f873abf4 doc_id: 829390 cord_uid: 1a2jptg3 Host genetic variants can determine the susceptibility to COVID-19 infection and severity as noted in a recent Genome-wide Association Study (GWAS) by Pairo-Castineira et al.1. Given the prominent genetic differences in Indian sub-populations as well as differential prevalence of COVID-19, here, we deploy the previous study and compute genetic risk scores in different Indian sub-populations that may predict the severity of COVID-19 outcomes in them. We computed polygenic risk scores (PRSs) in different Indian sub-populations with the top 100 single-nucleotide polymorphisms (SNPs) with a p-value cutoff of 10-6 derived from the previous GWAS summary statistics1. We selected SNPs overlapping with the Indian Genome Variation Consortium (IGVC) and with similar frequencies in the Indian population. For each population, median PRS was calculated, and a correlation analysis was performed to test the association of these genetic risk scores with COVID-19 mortality. We found a varying distribution of PRS in Indian sub-populations. Correlation analysis indicates a positive linear association between PRS and COVID-19 deaths. This was not observed with non-risk alleles in Indian sub-populations. Our analyses suggest that Indian sub-populations differ with respect to the genetic risk for developing COVID-19 mediated critical illness. Combining PRSs with other observed risk-factors in a Bayesian framework can provide a better prediction model for ascertaining high COVID-19 risk groups. This has a potential utility in the design of more effective vaccine disbursal schemes. Susceptibility to immune-mediated diseases and viral infections are both observed to be heritable traits, and are associated with specific genetic variants, such as rs11385942, rs10735079 [1] [2] [3] [4] [5] . The GWAS by Pairo-Castineira et al. 1 in critically ill COVID-19 patients from a UK cohort identified strong genetic signals, related to antiviral defence mechanisms and inflammatory organ damage, that are potentially associated with COVID-19 severity. Among the top eight robust associations identified in the GWAS 1 , two SNPs, namely, rs10735079 and rs2109069 are also present in the Indian Genome Variation Consortium (IGVC). The IGVC was a large-scale comprehensive study of the Indian sub-populations that was conducted to shed light on the genetic diversity among geographically and ethnically diverse Indian sub-populations 6, 7 . This study identified a high degree of genetic distinctness, with respect to SNPs, in different Indian sub-populations 6, 7 . With the increasing number of COVID-19 cases in India, a populous and a genetically diverse country, prioritizing vulnerable populations for COVID-19 vaccination is critical, given the limited production of vaccines and identification of genetic risk estimates associated with COVID-19 susceptibility can be beneficial in identifying susceptible population(s). GWAS have identified the genetic underpinnings of several diseases, and these variants together weighted by their effect sizes yield estimates for polygenic risk score (PRS). PRS provides an estimate of the genetic propensity of an individual to develop a disease and/or a trait 8,9 . Transethnic replication of GWAS effect sizes has been employed previously, however, it is challenging and might not lead to accurate predictions when applied to non-discovery GWAS populations, owing to biological differences, such as different patterns of linkage disequilibrium All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255447 doi: medRxiv preprint (LD), allele frequencies, and gene-environment interactions, in different populations 10 and/or technical differences. For example, there will be no transethnic replication if there is significant difference in the LD structure across different ethnic populations 11 . However, it has been shown that using training data sets which include samples from the discovery population in which the GWAS was conducted (confers the advantage of large sample size in the GWAS) as well as samples from the target population in which the PRS is aimed to be calculated (advantage of being the same ancestry), improves the prediction accuracy of the PRS 12, 13 . Hence using the causal variants identified in a discovery GWAS that overlap with the target population instead of taking the SNPs in LD with them, and picking the causal variants having a similar LD pattern across the discovery and target populations, would improve the accuracy of PRS calculated in the target population using the effect sizes of corresponding SNPs from the discovery GWAS. Here, with the data of stratified Indian sub-populations in hand, we calculated the PRSs with an aim to anticipate Indian sub-populations that are at a higher risk for COVID-19-mediated mortality. Considering the challenges associated with the transferability of the effect sizes, we also analyzed the differences in the patterns of LD, and used the SNPs with similar LD patterns in the discovery population and Indian population to ensure good prediction accuracy of the PRS. Based on these PRSs, we evaluated the population-wise susceptibility that can be of potential utility in more effective vaccine distribution schemes among Indian sub-populations. The polygenic predictors used in the present study were derived from Pairo-Castineira et al 1 , and applied on 25 geographically and ethnically diverse sub-populations of the Indian sub-continent 6, 7 . However, the existing differences in the LD patterns between different All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 20, 2021. Fig. S1 ), suggesting that the LD structure was maintained between the populations. As shown in Fig.1a Here, our results indicate that these subtle genetic differences can affect their susceptibility to COVID-19 mediated illness. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. We further investigated whether any non-risk SNP could result in strong correlation with COVID-19 mediated mortality. For this, we selected 100 random SNPs that were not significantly associated with the trait from the same GWAS, and the analyses were repeated There are certain limitations of the present study. The GWAS study was not directly conducted in the individuals of the Indian sub-populations, and the PRSs were based on effect sizes from different ancestral groups with COVID-19 infection. The mortality may also be affected by comorbidities like age, diabetes, hypertension, cardiovascular diseases [16] [17] [18] and environmental risk factors that can act as confounders. Further, the current study does not model the effect of confounders such as population density that could also affect the COVID-19 spread and mortality. Smaller populations might have less exposure to the virus, and therefore can display low mortality despite carrying a high PRS. For instance, IEELP4 population (Kandhamal district, All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255447 doi: medRxiv preprint population size and density of 7.3 lakhs and 90/km 2 respectively) has a high PRS but the number of deaths in this district is low, compared to other larger populations (with population size of 10-40 lakhs) with high PRS. Moreover, considering the districts having a similar population size and density (i.e. by removing the outliers), the correlation improved, although not significantly (R=0.43, p=0.092), suggesting that population density could act as a confounding factor. Also, a similar trend was observed in the number of cases over several months in the populations, suggesting that there could be a genetic basis for this trend (Supplementary Fig. S3 ). The prediction accuracy can be improved by using sequencing data and since IGVC is array data, some of the top causal variants were not represented which could possibly affect the PRS predictions. Increasing the sample size could also provide better accuracy, since IGVC captured only a few individuals of each ancestral group. In this study, we provide a methodological framework for predicting Indian sub-populations that could be at a higher risk for developing COVID-19 mediated critical illness but not any clinical evidence. These scores in conjunction with the commonly noted comorbidities could provide a good prediction in ascertaining high COVID-19 risk groups. Such accurate identification of vulnerable populations is crucial for the development of effective prevention and vaccination strategies. Such strategies applied to populations with defined genetic histories such as in the Indian subcontinent can be easily extended to model population level susceptibility to several other important diseases that strain the public health system in India, and provide a necessary use case justifying national scale projects such as GenomeIndia. All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255447 doi: medRxiv preprint For all our analyses, we used the summary statistics available at https://genomicc.org/data . The COVID-19 data for the Indian population was retrieved from ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/ , https://www.covid19india.org/, https://covid19.Assam. gov.in/district/ , https://github.com/covid19india/api . The codes used for the study can be accessed at the following GitHub repository : https://github.com/Prakrithi-P/COVID_PRS_IGV . Genetic mechanisms of critical illness in COVID-19 Human Genetic Determinants of Viral Diseases Host genetics and infectious disease: new tools, insights and translational opportunities Trans-ethnic analysis reveals genetic and non-genetic associations with COVID-19 susceptibility and severity No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted Indian Genome Variation Consortium. The Indian Genome Variation database (IGVdb): a project overview Genetic landscape of the people of India: a canvas for disease gene exploration Polygenic risk scores: from research tools to clinical instruments Developing and evaluating polygenic risk prediction models for stratified disease prevention Tread Lightly Interpreting Polygenic Tests of Selection Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations Trans-ethnic genome-wide association studies: advantages and challenges of mapping in diverse populations Multi-ethnic polygenic risk scores improve risk prediction in diverse populations varLD: a program for quantifying variation in linkage disequilibrium patterns between populations The 1000 Genomes Project Consortium. A global reference for human genetic variation Risks of and risk factors for COVID-19 disease in people with diabetes: a cohort study of the total population of Scotland Diabetes is a risk factor for the progression and prognosis of COVID-19 All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The authors declare that they have no competing interests. The Supplementary Not applicable All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Supplementary Fig. 1 Standardized varLD score across CEU and ITU populations. varLD scores for the SNPs analyzed in this study are marked in red, and majority of these are located in the low varLD regions reflecting low differences in LD with respect to these SNPs in these two populations. A similar pattern was observed for the few SNPs whose effect sizes were derived from East-Asian and African ancestral populations with CHB vs ITU and YRI vs ITU respectively. Supplementary Fig. 2 Correlation between COVID-19 mediated deaths and polygenic risk score calculated from non-risk SNPs. The p-value histogram shows a uniform distribution. All rights reserved. No reuse allowed without permission.(which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255447 doi: medRxiv preprint