key: cord-0907812-j79zvf08
authors: Balanovsky, O.; Petrushenko, V.; Mirzaev, K.; Abdullaev, S.; Gorin, I.; Chernevskiy, D.; Agdzhoyan, A.; Balanovska, E.; Kryukov, A.; Sychev, D.
title: The variation of genome sites associated with severe COVID-19 across populations: the worldwide and national pattern
date: 2020-11-24
journal: nan
DOI: 10.1101/2020.11.22.20236414
sha: bb2c2fc2b63e068cb5d1dd3c3bbf8ad7db6382a8
doc_id: 907812
cord_uid: j79zvf08

Background. The knowledge of clinically relevant markers distribution might become a useful tool in COVID-19 therapy using personalized approach in the lack of unified recommendations for COVID-19 patients management during pandemic. We aimed to identify the frequencies and distribution patterns of rs11385942 and rs657152 polymorphic markers, associated with severe COVID-19, among populations of the world, as well at the national level within Russia. The study was also dedicated to reveal whether population frequencies of both polymorphic markers are associated with COVID-19 cases, recovery and death rates. Methods. We genotyped 1883 samples from 91 ethnic populations from Russia and neighboring countries by rs11385942 and rs657152 markers. Local populations which were geographically close and genetically similar were pooled into 28 larger groups. In the similar way we compiled a dataset on the other regions of the globe using genotypes extracted or imputed from the available datasets (32 populations worldwide). The differences in alleles frequencies between groups were estimated and the frequency distribution geographic maps have been constructed. We run the correlation analysis of both markers frequencies in various populations with the COVID-19 epidemiological data on the same populations. Findings. The cartographic analysis revealed that distribution of rs11385942 follows the West Eurasian pattern: it is frequent in Europeans, West Asians, and particularly in South Asians but rare or absent in all other parts of the globe. Notably, there is no abrupt changes in frequency across Eurasia but the clinal variation instead. The distribution of rs657152 is more homogeneous. Higher population frequencies of both risk alleles correlated positively with the death rate. For the rs11385942 we can state the tendency only (r=0.13, p=0.65), while for rs657152 the correlation was significantly high (r=0.59, p=0.02). These reasonable correlations were obtained on the Russian dataset, but not on the world dataset. Interpretation. Using epidemiological statistics on Russia and neighboring countries we revealed the evident correlation of the risk alleles frequencies with the death rate from COVID-19. The lack of such correlations at the world level should be attributed to the differences in the ways epidemiological data have been counted in different countries. So that, we believe that genetic differences between populations make small but real contribution into the heterogeneity of the pandemic worldwide. New studies on the correlations between COVID-19 recovery/mortality rates and population's gene pool are urgently needed.

Thus, in our study we focus on the global population variation of both genomic sites, associated with the severe Covid-19.

We analyzed the variation of both SNPs in worldwide populations, based on this study dataset (1883 individuals) from North Eurasian countries and on the database compiled from the published data on other regions of the globe (3088 individuals). We aimed to unravel the patterns of geographic variation of these polymorphisms and investigate whether these genetic factors contributed to the variety of COVID-19 epidemiological parameters between different populations.

In the lack of unified recommendations for COVID-19 patients' management the knowledge of clinically relevant markers distribution may become a useful tool in COVID-19 therapy using personalized approach. Therefore, the initial goal of this study was to identify the frequency of rs11385942 and rs657152 markers among populations at the global level as well as at the national level within Russia. The latter level is important due to the high ethnic heterogeneity of populations from Russia and neighbor states. The second goal was to determine whether there is any connection between the population frequencies of rs11385942 and rs657152 polymorphic markers and COVID-19 cases, recovery and deaths rates.

Generating the dataset on the Russian and neighboring populations.

The samples from populations of Russia and neighbor North Eurasian counties (former USSR and Mongolia) were received from Biobank of North Eurasia [9] . All sample donors gave the written informed consent, approved by the is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint populations; for example in East Siberia we sampled Yakuts and Evenks rather than ethnic Russians who migrated to Siberia within last 1-3 centuries.

In total, we genotyped 1883 samples from 91 ethnic populations from Armenia, Azerbaijan, Belorussia, Georgia, Kazakhstan, Kirgizia, Lithuania, Mongolia, Russia, Tajikistan, Ukraine, and Uzbekistan. We focused on the two markers, rs11385942 and rs657152, associated with the severe Covid-19 [7] . We [11] , [12] , and iii) identifying groups of populations with homogeneity within the groups but pronounced variation between the groups; we omitted some populations which could not stand alone because of the low sample size and could not be pooled because of their genetic peculiarities; the details of this pooling procedure are described in [13] . The pooling resulted in the final dataset of 1785 individuals from 28 pooled populations with the average sample size 128 chromosomes (64 individuals). Table 1 presents the list of these populations ("Russian dataset" section). Their geographic locations are present on the map ( Figure 1 ). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ;  In addition to the dataset on the Russian populations, we compiled a dataset on the other regions of the globe. The data came from the 16 papers [14] - [29] which studied indigenous populations by using genome-wide Illumina arrays (Illumina700k, Illumina730k, Illumina660k, Illumina650k, Illumina610k,   Illumina550k , and Illumina1M). The resulting merged world dataset included 3336 samples, genotyped by rs657152 and 66 SNPs in the vicinity of rs11385942.

Similar to the Russian dataset we pooled some population samples to achieve larger sample sizes. In most cases we merged samples which came from the same country. Sometimes we have to merge data on the neighbor countries, for example data from Spain and France were merged. Two large countries -China and Indiawere present by 3 merged populations each. The samples from Russia and neighboring counties present in the world dataset were ignored to avoid double counting the same populations/samples. In total, the world dataset included 32 populations with the average sample size 104 chromosomes. Table 1 

One of the two Covid-associated markers, rs657152, was directly genotyped in both, Russian and world datasets. The frequencies of this marker in all populations were counted using PLINK software and are present in the Table 1 .

The second marker, rs11385942, is missed in all genome-wide panels available to date. The initial genome-wide association study on COVID-19 [7] genotyped 712,189 SNPs, then imputed 170 million SNPs, and one out of these imputed SNPs, namely rs113859420, demonstrated the highest association with severe COVID-19. Actually there is a haplotype block (approx. 50 kb length [8] ) in strong LD, and rs113859420 marks this block nearby as effective as any other SNP in the vicinity. In our study, for the Russian dataset we used a few times more dense panel (4,5 million SNPs) than the panel in [7] . Our panel included 250 SNPs in the is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint vicinity of rs113859420. Imputing was done by the Beagle software [30] by 200 iterations, using 1000 Genomes project dataset as a reference.

As the world dataset was based on the regular (less dense) Illumina panels, including 66 SNPs in the vicinity of rs113859420, we first estimated the imputation quality. For this aim, we compared the genotypes generated for the same samples by using 250 SNPs and 66 SNPs. 1871 out of 1883 genotypes (99,4%) coincided, indicating that regular Illumina panels work well for imputing rs113859420 genotypes. These genotypes were used to count the allele frequencies of rs113859420 in the populations from the world dataset (Table 1) .

The maps of the geographic distribution of the two COVID-associated markers were constructed using the GeneGeo software [31] , [32] . The maps of North Eurasia (Russia and neighboring counties) were created with the generalized Shepard's method, the degree of weight function set to 3 and radius of influence of 1,500 km. The world maps were created by the same method with radius enlarged up to 2,500 km (to fill in gaps between studied populations) and the degree of weight function set to 2 to generate a smoother surface, i.e. trend [33] ).

To provide correlation analysis we searched for the number of cases of COVID-19 per 1 M population, recovery rate per 1 M population and mortality caused by this disease per 1 M population, and mortality per finished COVID-19 cases (last updated on September 18, 2020). Because genetic data in each region described the indigenous populations while recovery/mortality rates were available for the total population of a region, we identified 16 regions of Russia and neighboring countries where indigenous populations constitutes the majority (on average, 85% according to the last census) and run correlation analysis on these 16 groups (Table 3) . For correlation analysis at the world level we used the same 16 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ;  groups from Russian and neighboring countries as well as all groups from the world dataset except data on Native Americans and Greenlanders (Table 4) .

To estimate the differences in allele frequencies between groups the Fisher's exact test was used using the application of GraphPad InStat software package (GraphPad Software, San Diego, CA, USA); the differences were considered statistically significant at p<0.05 with the Bonferroni correction. To determine the correlation between the alleles frequencies and epidemiological parameters per 1 million the Pearson correlation coefficient was calculated using Statsoft Statistica (Dell Statistica, Tulsa, OK, USA); the differences were considered statistically significant at p<0.05 with the Bonferroni correction.

This single nucleotide insertion was proved to have strongest association with the severe course of SARS-COVID-19 [7] . To unravel its worldwide variation we genotyped 1883 samples from Russian and neighboring countries, and compiled the dataset on 3088 samples from other regions of the globe. The final worldwide dataset contains frequencies of rs11385942_GA in 60 populations (Table 1 ). Figure 2 presents the map of this marker's variation worldwide. The highest frequency (20-30%) was found in South Asia, followed by West Asia and Europe (5-15%). This marker is rare or not detected in East Asia, North Asia (Siberia), Native Americans, and in Sub-Saharan Africa, while present in North Africa. The data on South-East Asia and Papua are scarce but indicate that this insertion might be present there at elevated frequencies. Generally, the worldwide spread of this marker follows the "West Eurasian" pattern, well-known in human populations genetics for many other genetic markers, including mitochondrial DNA and Ychromosomal markers [34] , [35] . The South Asian populations have much stronger genetic affinities with West Eurasians than with East Eurasians. The specific feature of rs11385942 distribution pattern is that its frequency is higher in South is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint Asian part rather than in West Asian/European part of the West Eurasian gene pool.

This description of a global pattern is inevitable schematic. To compensate this, we coupled the analysis at the worldwide level with the detailed analysis at the national level of the largest country.

Our extensive Russian dataset allows investigate the variation of rs11385942 within the area of Russia and neighboring countries in more details. Frequencies are available in the Table 1 and visualized on the map (Figure 3 ). The map reveals that the difference between higher frequencies in Europe and absence in North Asia (which is obvious on the world map) is not sharp. Instead, there is a pattern of clinal variation, i.e. gradual decrease in frequency throughout 7 thousand kilometers, from the frequencies 13-16% in Ukraine, Belorussia and westernmost Russian populations to the zero frequency in Kamchatka and Chukotka peninsulas at the Pacific coast. Generally, in populations of European Russia the average frequency is 11% while in Siberia it is only 3% (Table 1 ). In Central Asian countries the frequencies are generally low (1-4%). Tajikistan is the notable exception (frequency is 14%, significant difference from neighbor countries) because Tajikistan population is geographically and genetically close to South Asian populations which exhibit the world maximum of rs11385942.

Focusing on details of the frequency distribution, one may note that rs11385942 frequency in the southwestern regions of Russia is quite similar in comparison with bordering Belarussia and Ukraine -the differences were insignificant (Table 1 ). In the Caucasus, the lowest frequency of the risk GA allele variant was found in the central regions (Chechnya, Ingushetia, North Ossetia) -6%, with higher frequency in western and eastern regions (≈ 10%), and particularly in South Caucasus (14%).

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint Figure 4 presents the frequency distribution map of rs657152. In comparison with rs11385942 polymorphic marker, this distribution is more homogeneous.

Generally, it is quite frequent among almost all Old World populations. The highest frequencies (above 50%) were observed in Sub-Saharan Africa, while the most Eurasian populations have frequencies between 40 and 50%. At the periphery of Eurasia (Atlantic fridge of Europe, Far East, South-East Asia) the frequency tends to drop below 40% and in Native Americans and Australasians it is nearby absent.

Among populations of European Russia the rs657153 frequency varied from 38 to 42%, which was comparable with Belorussia (33%, p>0.05) and lower than in Ukraine (51%, p<0.05) (Tables 1, 2). In Caucasus rs657152 was most frequently met in eastern regions (Daghestan, 52%), the lowest percentage was found in western regions (27%). Distribution differences in rs657152 among the population of these regions were most significant in comparison with other included regions (Table 2 ). In Asia, among Tuvans and Mongols, the frequency of rs657152 was 38-39%, while among the populations of Uzbekistan and Kyrgyzstan the frequency was 51% (with the insignificant differences in comparison with Kazakhstan and Tajikistan, Table 2 ).

We compared frequencies of both markers in various populations of Russia with the recovery and mortality rates from COVID-19 in the same populations.

The epidemiological numbers (number of COVID-19 cases, recoveries and death) and population sizes are present in the Table 3 . We have observed (Table 5) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint found positive correlations between both risk alleles and the mortality rate, measured as number of deaths per number of Covid-19 cases, and negative correlation between both risk alleles and the recovery rate (number of recoveries normalized by number of cases). The correlation of mortality with the frequency of rs657152 (r=0,63) was significant with p-value 0,01 (Table 5) , while correlation with the frequency of rs11385942 (r=0,24) was insignificant (p=0.38). Correlation between rs657152 and mortality rate remained significant after Bonferroni correction. When analyzed the combined dataset (Russian and world populations), all correlations became insignificant (Table 5 ).

The genome-wide association study by Ellinghaus D. et al. revealed the strongest association signal at locus 3p21.31 comprised six genes (including SLC6A20, LZTFL1, CCR9, FYCO1, CXCR6, XCR1): rs11385942 risk allele GA is associated with a genetic predisposition to COVID-19 mediated acute respiratory failure. The frequency of risk allele was higher among patients who received mechanical ventilation compared to those who received only oxygen supplementation. The most relevant genes to COVID-19 were SLC6A20, LZTFL1 and CXCR6. The gene SLC6A20 encodes sodium-iminoacid transporter 1, which closely interacts with ACE2the "entrance gate" of SARS-CoV-2 [36] , [37] .

LZTFL1 regulates the functions of cilium and intra-flagellar transport in cells [38] .

The CXCR6 gene regulates the specific location of resident memory T cells in different parts of the lung, providing a stable immune response to pathogens in respiratory tract [39] . The carriage of GA risk allele of determined the decreased expression of CXCR6 and the increased expression of SLC6A20 and LZTFL1 genes in human lung cells, which determines the severe COVID-19 [7] . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. The genome site exhibiting the second high association with severe COVID-19, rs657152 is located within the AB0 blood group locus. The importance of AB0 blood groups system for COVID-19 was also reported in other studies, namely the risk of respiratory failure in COVID-19 was highest in patients with blood group A compared to other groups while the lowest risk was in patients with blood group O, [40] , [41] , though these studies are primarily concerned with the risk of infection and not with disease severity [42] . The protective effect of blood group 0, in contrast to other groups, was explained by the presence of neutralizing antibodies against protein-linked N-glycans [7] , [43] . It is also known that there is a linkage between ABO blood group locus and the expression of the von Willebrand factor's gene (VWF) (12p13.31 locus), which in connection with VIII factor promotes the clots formation on the surface of damaged vessels. Pulmonary endothelial cells in non-O groups are associated with higher levels of VWF compared to O group [44] , [45] , which may explain the role of AB0 blood group locus in COVID-19 patients.

However, the rs657152 does not directly encode blood group A or any other blood group. Instead, it can be used to distinguish rare variants [46] . The fact that GWAS identified the strongest association with this SNP which does not code the is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint blood group while other studies identified association with the classical blood groups indicates the complexity underlying these associations which might be resolved by future studies involving the sequencing of the entire AB0 locus.

Meanwhile, we can consider the frequency distribution of AB0 group A ( Figure 5 ).

The most impressive feature of this map is the large amount of data it is based on: 2757 populations, which is almost thirty times more than the dataset for the other SNPs analyzed in this study. This was because the map ( Figure 5 ) is based on the frequencies of blood groups accumulated in 20 th century due to works of many scholars. This case exemplifies the value of old-fashion datasets, because using modern datasets (like 1000 Genomes Project or the dataset analyzed in our study) one could obtain a rough, approximate distribution map, while using the data from books published in the second half of 20 th century provided the very detailed and reliable picture ( Figure 5 ). This picture demonstrated that highest frequencies (30% and higher) were observed in West Europe, slightly lower in the Volga-Ural region in East Europe, and generally low frequencies (10-20%) in North Asia is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020. 11.22.20236414 doi: medRxiv preprint are associated with severe course of disease rather than with the susceptibility.

Second, the correlations with the outcomes of the disease were as they should be: for both markers, higher population frequency of risk alleles correlated positively with the deaths rate. For the rs11385942 we can state the tendency only, while for rs657152 the correlation was high (r=0,6) and significant (p-value 0,01, Table 5 ).

Notably, we found these reasonable correlations on the "Russian" dataset only, while correlations on the world dataset were zero. This difference could be attributed to the pronounced difference between countries how the not-severe cases are counted: the number of underestimated COVID-19 cases affects the mortality rate dramatically, making it impossible to reveal the correlation with genetic factors. On the contrary, when using epidemiological statistics within one country (in our case, Russia) the ways to count COVID-19 cases become more uniform, and the correlation with variation of the risk alleles frequency becomes evident.

The genetic data (allele frequencies) used in our study are also affected by relatively low samples sizes, thus statistical noise is expected to decrease the correlation values. To this end we believe that future studies will identify even stronger correlation between recovery/mortality rates and population's gene pool, and that genetic variation between populations makes small but real contribution into the heterogeneity of the pandemic worldwide. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; https://doi.org/10.1101/2020.11.22.20236414 doi: medRxiv preprint The used scale is more detailed that than on the world map ( Figure 2 ). The map was modified from [47] . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted November 24, 2020. ; Table 1 . Frequencies of the two genetic markers associated with severe Covid-19 in various populations ("Russian" and "world" datasets).

Note: samples sizes are given in chromosomes (twofold number of individuals). 

WHO Director-General's opening remarks at the media briefing on COVID-19

Clinical Characteristics of 138 Hospitalized Patients With 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China

European Centre for Disease Prevention and Control

The Role of Vitamin D in The Age of COVID-19: A Systematic Review and Meta-Analysis Along with an Ecological Approach

The role of vitamin D in the prevention of coronavirus disease 2019 infection and mortality

Perspective: Vitamin D deficiency and COVID-19 severity -plausibly linked by latitude, ethnicity, impacts on cytokines, ACE2 and thrombosis

Genomewide Association Study of Severe Covid-19 with Respiratory Failure

The major genetic risk factor for severe COVID-19 is inherited from Neanderthals

Population biobanks: organizational models and prospects of application in gene geography and personalized medicine

Second-generation PLINK: rising to the challenge of larger and richer datasets

Principal components analysis corrects for stratification in genome-wide association studies

Population Structure and Eigenanalysis

Application of the population biobank for analysis of the distribution of the clinically significant DNA markers in the Russian populations: bioinformatic aspects

No Evidence from Genome-Wide Data of a Khazar Origin for the Ashkenazi Jews

Population Genetic Structure in Indian Austroasiatic Speakers: The Role of Landscape Barriers and Sex-Specific Admixture

Afghan Hindu Kush: Where Eurasian Sub-Continent Gene Flows Converge

Autosomal and uniparental portraits of the native populations of Sakha (Yakutia): implications for the peopling of Northeast Eurasia

Genomic study of the Ket: a Paleo-Eskimo-related ethnic group with significant ancient North Eurasian ancestry

Genetic evidence for an origin of the Armenians from Bronze Age mixing of multiple populations

Standing at the Gateway to Europe -The Genetic Structure of Western Balkan Populations Based on Autosomal and Haploid Markers

Genetic Heritage of the Balto-Slavic Speaking Populations: A Synthesis of Autosomal, Mitochondrial and Y-Chromosomal Data

Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation

Shared and Unique Components of Human Population Structure and Genome-Wide Signals of Positive Selection in South Asia

Upper Palaeolithic Siberian genome reveals dual ancestry of Native Americans

The genetic prehistory of the New World Arctic

Genomic evidence for the Pleistocene and recent population history of Native Americans

Ancient human genome sequence of an extinct Palaeo-Eskimo

The Caucasus as an Asymmetric Semipermeable Barrier to Ancient Human Migrations

The Genetic Legacy of the Expansion of Turkic-Speaking Nomads across Eurasia

A One-Penny Imputed Genome from Next-Generation Reference Panels

Parallel evolution of genes and languages in the Caucasus region

Geoinformation technologies in gene geography

Computer technology of gene geographic studies of the gene pool. III. The isolation of the trend surfaces

Most of the extant mtDNA boundaries in South and Southwest Asia were likely shaped during the initial settlement of Eurasia by anatomically modern humans

Early replacement of West Eurasian male Y chromosomes from the east

Trilogy of ACE2: A peptidase in the renin-angiotensin system, a SARS receptor, and a partner for amino acid transporters

Human intestine luminal ACE2 and amino acid transporter expression increased by ACEinhibitors

A Novel Protein LZTFL1 Regulates Ciliary Trafficking of the BBSome and Smoothened

CXCR6 regulates localization of tissue-resident memory CD8 T cells to the airways

Relationship between the ABO Blood Group and the COVID-19 Susceptibility

Testing the association between blood type and COVID-19 infection, intubation, and death

Characteristics of and Important Lessons From the Coronavirus Disease 2019 (COVID-19) Outbreak in China

Harnessing the natural anti-glycan immune response to limit the transmission of enveloped viruses such as SARS-CoV-2

ABO blood group and von Willebrand factor: biological implications

ABO blood group is a determinant of von Willebrand factor protein levels in human pulmonary endothelial cells

ABO blood group

Russian gene pool" Moscow, Luch Print