key: cord-0793727-or0rfxuu authors: Rana, Vishal; Chien, Eli; Peng, Jianhao; Milenkovic, Olgica title: How fast does the SARS-Cov-2 virus really mutate in heterogeneous populations? date: 2020-04-27 journal: nan DOI: 10.1101/2020.04.23.20076075 sha: 60d96aedce44dc4e1d394dc926646b3099309b19 doc_id: 793727 cord_uid: or0rfxuu We introduce the problem of determining the mutational support of genes in the SARS-Cov-2 virus and estimating the distribution of mutations within different genes using small sample sizes that do not allow for accurate maximum likelihood estimation. The mutational support refers to the unknown number of sites mutated across all strains and individual samples of the SARS-Cov-2 genome; given the high cost and limited availability of real-time polymerase chain reaction (RT-PCR) test kits, especially in early stages of infections when only a small number of genomic samples (∼ 1000s) are available that do not allow for determining the exact degree of mutations in an RNA virus that comprises roughly 30, 000 nucleotides. Nevertheless, working with small sample sets is required in order to quickly predict the mutation rate of this and other viruses and get an insight into their transformational power. Furthermore, with the small number of samples available, it is hard to estimate the mutational landscape across different age/gender groups and geographical locations which may be of great importance in assessing different risk categories and factors influencing susceptibility to infection. To this end, we use our state-of-the art polynomial estimator techniques and the Good-Turing estimator to obtain estimates based on only roughly 1, 000 samples per category. Our analysis reveals an interesting finding: the mutational support appears to be statistically more significant in patients which appear to have lower infection rates and handle the exposure with milder symptoms, such as women and people of relatively young age (≤ 55). Viruses tend to mutate rapidly for a number of reasons, including highly unreliable replication of their genetic content and the need to evolve, adapt and compete with the host organism. The rate of mutation varies widely across various types of viruses and has been extensively studied in the past [1] , [2] . It is known that RNA viruses tend to mutate faster than DNA viruses as RNA replication is much less accurate than DNA replication. Similarly, single stranded viruses are more likely to mutate than double stranded ones [3] due to their structural instabilities. There is also evidence to indicate the length of the viral genome is inversely correlated with the mutation rate, with shorter viruses mutating faster those having longer strands of genomic material [4] . Mutational and fitness landscapes of viruses are frequently used to determine their evolvability and potential to spread across diverse populations [5] , [6] , [7] . If the immune system of a host encounters viral protein from a strain it was already exposed to, its response is fast and the infected cells are efficiently eliminated. If the virus mutates at a very high rate, the host immune system may take longer to respond, giving the virus more time to replicate and spread. This phenomenon is known as antigenic drift [8] . It is hence widely believed that fast mutating viruses pose a greater health risk as they provide an escape mechanism not countered by the host. Nevertheless, some recent studies have shown that high mutation rates could also trigger rapid innate immune response in the host; they can also be a sign that the host is successfully fending of the infection and that the virus has to explore a significant number of changes to its genome to successfully compete with the immune system. Elevated levels of mutations can be disadvantageous to the survival of virus, at least at short time scales [9] . It hence remains an open problem to determine the exact causes for elevated mutation rates in some viruses and their correlation to clinical patient outcomes. Despite the fact that all potential sources for viral mutations are still unknown, a large body of works reports mutational rates of viruses as indicators of their virulence and potential to cause epidemic and pandemic outbreaks [10, 11] . Almost exclusively, the estimates are based on simple counts of mutations in sequenced genomes, using a reference retrieved either from Patient 0 (the first infected individual) or more frequently, from Patient 1 (the first sequenced individual). Given the very limited number of samples (i.e., sequenced genomes) compared to the length of the genomes Figure 1 : Organization of the SARS-Cov-2 genome, taken from Wikipedia. (ranging from several 1, 000s to 100, 000 nucleotides) it is apparent that naive counts and the corresponding maximum likelihood estimators are inadequate for this purpose. This small sample effect is well known and extensively studied in the machine learning community [12, 13] . The distribution of the mutations and the mutational support have not been studied from an estimation perspective. In this work, we present methods for determining support of mutations and their distribution given sequencing data from a limited number of patients. The problem of mutational support and distribution is of independent interest for future outbreaks as well. It is important to be able to come up with a mutational landscape with limited number of samples available during the very early phases of the outbreak. Organization of the SARS-Cov-2 genome. A breakdown of the genomic structure of SARS-Cov-2 is shown in Figure 1 , and described in detail in [14] . Typically, coronaviruses have genomes including at least six open reading frames (ORFs). ORF1a and ORF1b constitute the longest component of the genomes are responsible for encoding two polypeptides, pp1a and pp1ab, which are jointly used to create a family of nsp proteins. The family includes replicasetranscriptase proteins, responsible for promoting cellular mRNA degradation and blocking the translation in host cells, thereby impairing the operation of the immune response, proofreading and scaffolding proteins, processivity clamps, as well as transmembrane proteins. The pp1a/b polypeptides are functionally combined using proteases, such as the native chymotrypsin-like protease. Viral structural proteins are encoded by the sgRNA region of CoVs, and include the spike (S), membrane (M), envelope (E), and nucleocapsid (N) proteins, as well proteins encoded by ORF 10. For RT-PCR testing and detection of Covid-19, the oligonucleotide primers and probes used should be selected from the nucleocapsid (N) gene region, per recommendation by the CDC and as provided in panels produced by IDT [15] . The latter panel is designed for specific detection of the 2019-nCoV (two primer/probe sets). As a control, additional primer/probe sets used as controls such as the human RNase P gene (RP) are included in the panel. It is hence of special interest to focus particularly on the N region of the genome, as high-rate mutations in this region may cause highly undesirable false negatives in the test outcomes. Data acquisition. For our analyses, we used data from the GISAID EpiCoV [16] database which contains sequenced viral strains collected from patients across the world. We downloaded the data on several occasions, starting from 04-03-2020, and continuing on 04-10-2020 and 04-14-2020. The data size grew significantly during this time span and in order to observe the influence of the sample set sizes on the estimates of the mutational supports we used different sizes of samples. We filtered the genomic datasets only to include nearly-complete samples i.e., those of length > 29, 000 nts, resulting in a total of 3511 samples in 04-03-2020, 5650 samples on 04-10-2020 and 8893 samples on 04-14-2020. We also downloaded the associated metadata. As the first step in our analysis, we used the sequence alignment software MUSCLE [17] to perform pairwise alignment of all the samples with the SARS-Cov-2 sequence of Patient 1, published under the name Wuhan-Hu-1, collected from a patient admitted to the Central Hospital of Wuhan on December 26, 2020 (GenBank accession number MN909847). Next, for each aligned pair of samples we generate a "mutation profile", a list containing the position in the reference genome in which the patient aligned to the reference has a mutation. We do not perform multiple sequence alignment in order to assess the mutation landscape as we need to analyze each patient data separately. The mutation profile lists are subsequently aggregated over all the patient samples, resulting in a mutation histogram accounting for all positions in the viral reference genome. The aggregate profiles are then partitioned according to the 11 genes they are located in on the viral genome depicted in Figure 1 . The total count of mutations for each location in each gene is used as a sufficient statistics for estimating the mutational support and distribution of the mutations in each of the 11 genes. To adjust for alignment artifacts introduced by sequencing errors, dropouts and alignment gaps, we removed all gaps encountered in the prefixes and suffixes and sufficiently long gaps (> 10 nts) within the alignments. Most gaps are encountered at the 5'UTR and 3'UTR regions of the genome, as expected from outputs of global alignment algorithms. As there exists a large body of evidence of stratified susceptibility and severity of symptoms across different racial, age and gender groups, we perform four types of mutational support and distribution estimates. In the first set of tests, we divided the patient mutations based on gender (Male/Female), based on age (under 55/over 55) and based on geographical locations (Asia/ North America/ Europe) and based on a combination of features for which sufficiently many samples are available, such as Male/Female, Below55/Above55, Europe. Since the number of samples per feature type may vary significantly, we performed two tests. In one test we used all samples available, while in another we adjusted for difference in sizes of the sets by subsampling the larger of the two classes to make the sample set sizes equal. The number of samples available for various classes is depicted in Table 1 . For data obtained on 04-03-2020, we used all the samples available for all the classes, without balancing the class sizes. For data from 04-10-2020 and 04-14-2020, we balanced the classes by subsampling from the larger of the two classes for both age and gender. For geographical regions, we used all 615 samples from Asia and subsampled Europe and North America to 1000 samples each for 04-10-2020. Similarly, we used all 636 samples from Asia and subsampled Europe and North America to 1774 samples each for 04-14-2020 to account for differences in class sizes. It is important to point out that by performing the experiments with different sample set sizes one can compare the quality of the estimates in the early stages of epidemics and later stages when more information about individual strains of viruses becomes available. Furthermore, the methods outlined in this work apply to any other viral or bacterial dataset collection with the obvious modifications in place to account for the genetic profile of the microorganisms. Estimation methods. The most commonly used techniques for support and distribution estimation are maximum likelihood (ML) methods which may be seen as methods that directly employ the empirical counts of the symbols to determine the quantities of interest. It is well known that ML approaches perform poorly for large alphabet sizes (supports of the distribution) when only a small number of samples from the distribution is available as they fail to take account the fact some samples have never been observed due to limited data. The problem of estimating the support of an unknown probability distribution or estimation the distribution itself in the context of small sample sets has a long history. The first line of work in this area is attributed to Laplace, as described in [18] , who introduced a class of smoothed distribution estimators termed add 1 (or more generally, add constant c estimators). These estimators adjust the counts of symbols in order to account for the unseen. The support of a discrete probability distribution is the number of symbols with positive probability. We define the mutational support of a virus as the total number of genomic sites mutated in any viral genome in any individual (observed or unobserved due to limited testing) compared to a reference genome, which in this case is the Patient 1 genome, the first sequenced SARS-Cov-2 genome. Our postulate is that the mutational support provides a good assessment of the overall number of mutations encountered in a virus and its strains throughout an epidemic/pandemic outbreak. Other types/definitions of mutation rates for SARS-Cov-2 have been widely reported in the literature. What is referred to as the genomic mutation rate is the product of the per-nucleotide site mutation rate and the genome size, and it represents the average number of mutations each offspring will have compared to the parental (or ancestral) genome. RNA viruses have a per site mutation rate that lie in the range 10 −6 − 10 −4 [19] . The mutation rate of a virus is often equated with the rate at which errors are made during replication of the viral genome. Clearly, determining the genomic mutation rate in a large carrier population appears to be a challenging task as each host will have a different mutation rate and due to the fact that distinguishing offspring and ancestors appears hard. Furthermore, replication errors may clearly not be the only mechanism behind viral mutations. The genome mutational rate for SARS-Cov-2 is estimated at roughly 2 to 3 mutations a month. We argue that the mutational support more accurately reflects the mutational power of a virus than the mutation rate based on RNA replication analysis alone as it is obtained through a small-sample statistical analysis of a cohort of hosts. The underlying statistical approaches and methods are designed to account for unsequenced and hence unseen mutations and genomes. To estimate the mutational support given small sample sets, we use the polynomial estimators in [20] based on the method first described in [21] . For completeness, the regularized weighted Chebyshev estimators are described below. Let P = (p 1 , p 2 , . . .) be a discrete distribution over some finite alphabet and let x n be a vector of i.i.d. samples drawn according to the distribution P . The problem of interest is to estimate the support size, defined as S(P ) = i 1 {pi>0} . We use S instead of S(P ) to avoid notational clutter. An important assumption used in all estimation methods is that the minimum non-zero probability of the distribution P is greater than 1 k , for some k ∈ R + , i.e., inf{p ∈ P | p > 0} ≥ 1 k . We let D k denote the space of all probability distribution satisfying inf{p ∈ P | p > 0} ≥ 1 k . A sufficient statistics for x n is the empirical distribution (i.e., histogram) n = (n 1 , n 2 , . . .), where n i = n j=1 1 {xj =i} and 1 A stands for the indicator function of the event A. To determine the quality of the estimator, we use the minmax risk under normalized squared loss R * (k, n) defined as We seek a support estimatorŜ that minimizes The first term within the supremum captures the expected bias of the estimatorŜ. The second term represents the variance of the estimatorŜ. A"good" estimator is required to balance out the worst-case contributions of the bias and variance. The Chebyshev polynomial of the first kind of degree L is defined as The coefficients in the above expansion equal The estimator proposed in [21] takes the formS = ig L (N i ), wherẽ By introducing a regularization term and an exponential weighting factor, this estimator can be significantly improved in practice as documented in [20] . Since the estimator formulation is nontrivial, we omit its full description and refer the interested reader to the previously cited work. We only remark that the estimator termed RWC (regularized weighted Chebyshev) optimizes the regularized risk described above, while the RWC-S estimator uses a risk objective which involves a different normalization term. By far the most frequently used method for distribution estimation is the Good-Turing estimator [12] , which in a slightly modified form may be described as follows. For a sequence x n of length n over an unknown finite alphabet, we let n i denote the number of times a symbol i appears in x n . Furthermore, we let ϕ t stand for the count of counts, i.e., the number of symbols that appear t times in x n . The estimator proposed in [13] combines the Good-Turing and ML estimators, the latter being used for the frequently observed symbols. For symbols that appear t times, if ϕ t+1 ≥ Ω(t), then the Good-Turing estimate is used to determine the underlying total probability mass, otherwise, the ML estimator is used instead. More precisely, for a symbol appearing t times, if ϕ t+1 ≥ t we use the Good-Turing estimator, otherwise we use the empirical estimator. If n i = t, then the probability of the symbol i is computed according to if t > ϕ t+1 , ϕt+1+1 ϕt t+1 N , otherwise, where N is a normalization term that ensures that the obtained values are probability masses. The term ϕ t+1 used in the Good-Turing estimator is replaced by ϕ t+1 + 1 so that every symbol has a nonzero probability. A software implementation of Good-Turing estimators is available at: http://crr.ugent.be/papers/A%20Python%20program%20to%20calculate%20the%20Good-Turing%20algorithm.pdf. Modifications of the Good-Turing estimator that take sampling artifacts/errors such as community structures into account may be found in [22, 23] . 3 Results The first set of results pertains to data collected at an earlier stage of the pandemic (04-03-2020) that did not include sufficiently many samples to allow for sample set size leveling and therefore included all available samples. From Table 2 , it is apparent that naive ML methods underestimate the mutational support in the ORF1a and ORF1b genes roughly two-fold and that the mutational support of both genes is roughly 10% of the total gene lengths. Note that it is interesting to observe that both the ML and RWC-S estimators indicate that the mutational support is higher in younger patients, but in this case the results may be explained by the uneven sample set sizes for the two patient categories (909 versus 1, 477) . Also, the mutational support of the N region is significantly smaller, amounting to roughly 1% of the genome length for both categories. Similar results may be observed for the case that patients are partitioned according to gender, as listed in Table 3 . An interesting observation is in place for the results pertaining to different geographic regions. Despite the fact that the number of available samples from Asia is smaller than that of Europe and North America, the mutational support in the ORF1a region of Asian patients is more than twice as large as that of North America. A similar result holds true for the case of ORF1b, where the European population has three times more mutations than the North American population. ORF1a encodes replicase polyproteins pp1a and pp1b, which implies that the replication machinery has undergone significantly more adaptations in Asia than North America. It has been documented that the ratio of ORF1a-and ORF1b-encoded proteins plays an important role in the replication efficiency of coronaviruses [24] . Tables 5,6,7 list the results analogue to those in Table 2 ,3,4 respectively, obtained using larger datasets retrieved on 04-10-2020 which allow for random subsampling that leads to equal sample set sizes for all subpopulations considered. Based on the results of Table 5 , one additional week of data collection amounting to roughly twice the samples increased the mutational support by 5% for both ORF1 and ORF1b. On the other hand, the additional data samples show that the N region of the SARC-Cov-2 genome exhibited a much more significant increase in mutations than as could be predicted from early small-set sample sizes, amounting to roughly 10% of the genome. This finding is of great significance for Covid-19 and other viral outbreak testing methods as it indicates that genomic regions used as identifiers for the virus may mutate much faster than predicted based on small preliminary sample set information and that one may have to change the primers used for testing as the disease progresses. It also suggests that, as different strata of the population exhibit different mutation rates, different primers have to be used for testing them. The most surprising result is listed in Table 6 , and pertains to the ORF1b region. As may be seen, the mutational support in the female population is 1, 621 compared to 941 in the male population, which amounts to a 8.4% difference with respect to the length of the open reading frame. To address this issue further, we performed another test the results of which are shown in Table 11 and discussed later. Geographic trends are depicted in Table 7 and follow the same trend observed in Table 4 . Tables 8, 9, 10 show the trends of increase for the mutation support with increased sample sizes, which in this case exceed 8, 000. ORF3a 84 62 38 158 113 71 154 100 63 828 E 37 11 6 66 19 9 65 19 9 228 M 30 29 15 53 49 25 50 44 24 669 ORF6 2 28 5 2 46 8 2 45 7 186 ORF7a 108 38 49 215 66 90 214 65 89 366 ORF8 340 27 19 341 46 26 342 43 28 366 N 53 90 68 93 152 122 85 137 114 1,260 ORF10 10 25 9 18 28 15 17 27 14 117 (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 27, 2020. . S 182 163 142 352 336 269 340 293 243 3,822 ORF3a 91 56 39 174 96 74 168 85 63 828 E 37 12 14 66 21 24 65 21 24 228 M 31 23 17 55 38 28 52 35 27 669 ORF6 3 48 15 3 87 26 3 86 25 186 ORF7a 109 63 51 216 118 94 214 116 93 366 ORF8 340 19 21 335 29 31 339 29 31 366 N 58 72 77 96 121 137 91 108 129 1,260 ORF10 10 26 7 18 48 10 17 48 10 117 7 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 27, 2020. . Table 8 : Support sizes for different age ranges based on 3, 047 samples in each group. The data was retrieved from GISAID on 04-14-2020. Note that the entry in the table under ORF8 marked by * corresponds to a rare scenario where our estimators produce a value smaller than that predicted by an ML estimator. This is due to severe sampling artifacts and in this case, one should choose the larger of the two estimates available. Table 9 : Support size differences for males and females based on 2, 817 samples for each group. The data was retrieved from GISAID on 04-14-2020. S 405 389 790 716 740 673 3822 ORF3a 169 140 272 255 262 230 828 E 30 36 47 63 45 61 228 M 67 69 107 119 103 112 669 ORF6 50 40 87 66 84 65 186 ORF7a 68 72 113 106 110 105 366 ORF8 343 342 347 345 347 345 366 N 195 204 338 348 312 327 1,260 ORF10 31 13 33 22 38 21 117 8 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 27, 2020. . M 27 33 45 58 40 52 669 ORF6 15 28 25 47 24 48 186 ORF7a 31 23 54 36 52 36 366 ORF8 27 28 45 48 43 46 366 N 110 108 197 199 178 183 1,260 ORF10 27 5 28 7 33 7 117 Table 11 provides results for a finer partition of test samples into two categories, one including males over 55 years of age and another females below 55 years of age, with both populations sampled from Europe. The first category has been empirically observed to be at higher risk of infection and for exhibiting more severe symptoms [25] . The important finding is that the mutational support of ORF1b is almost twice as large in the low risk population compared to the high risk population. This result may imply that the large mutational support is a result of a highly competitive virus-host interaction which forces the virus to mutate in order to gain advantage over the host's immune system. Next, we examine the distribution of mutations in the ORF1a,b and N regions of the SARS-Cov-2 virus obtained using the Good-Turing estimator and once again focusing on different population traits. As may be seen from Figures 2 and 3 there is a surprisingly small difference in the distribution of the top-20 mutated sites across different age and gender groups, except for a marked difference in the largest probability (in particular, in the N region for populations partitioned according to age and populations partitioned according to gender when including larger sample sets from 04-14-2020). This is especially the case for samples partitioned according to gender, despite the fact that the number of mutations in female subjects in the ORF1b region was close to twice as much as that in male subjects. In addition, the probability of having a mutation at the highest probability sites is significantly larger in "younger" than "older" populations. The trend remains the same for different collection dates as supported by the results in Figures 5 and 6 . The situation is completely different when comparing the distributions of mutations across different geographic 9 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 27, 2020. regions (Figures 4 and 7) , where there are significant differences in the distributions as one would expect. To compare the distributions, we computed all three pairwise symmetric Kulback-Leibler (KL) divergences for the normalized top-20 mutation probabilities. The symmetric KL divergence between two discrete probability distributions p and q is defined as For the mutation distributions pertaining to the pairs Europe-NA, Europe-Asia and Asia-NA, the KL divergences equal 0.672, 0.316 and 0.376 (ORF1a), 0.491, 0.435 and 0.646 (ORF1b), 0.293, 1.021 and 0.303 (N), respectively, for data collected until 04-14-2020. These results indicate that the largest differences in the distributions in the ORF1a region exist between Europe and North America, while the largest differences in the ORF1b region exist between Asia and North America. For the N region, a significant difference between the distributions of mutations is observed between Europe and Asia, and at this point, these large distances do not seem to have a simple explanation. Similarly, the corresponding KL divergences based on the samples collected until 04-10-2020 equal 0.788 (which is significantly larger than the one predicted based on data collected on 04-14-2020), 0.328 and 0.371 (ORF1a), 0.743 which is significantly larger than the one predicted based on data collected on 04-14-2020), 0.615 and 0.0.755 (ORF1b), 0.315, 0.893 and 0.248 (N), respectively. The results for the KL divergences for the N regions suggest relatively small changes in the distribution of mutations in the N region, and larger changes in the ORF1a and ORF1b regions, which is expected. The distributions of mutations only reveal the statistical landscape of the mutation sites but not their exact locations in the genome. The actual mutated sites in the SARS-Cov-2 genomes are depicted in Figures 9,10 and 11 . As can 10 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 27, 2020. be seen, the locations of the mutations for the first two pairs of categories are almost identical. Nevertheless, the positional stratification of mutations is significant for patients from different continents, especially in the N region of the SARS-Cov-2 genome. The largest spread of probability mass is observed for patients in Asia which may be indicative of the larger exploration rate for mutations in the region where the outbreak originated. Another plausible explanation is that Asia, the origin of the pandemics, is in a later phase of the pandemic when compared to Europe and North America. The observation may also has an impact on the design of testing schemes which use the N region of the genome as for patients from Europe there exist only 2-3 sites with high mutation rates, with a similar trend observed for North American populations. Figure 11 : Positions in the SARS-Cov-2 genome with high probability of mutations in patients across three different continents collected until 04-14-2020. The height of the bar is proportional to the probability of the mutation. Figure 12 : Positions in the SARS-Cov-2 genome with high probability of mutations in European females below the age of 55 and males above the age of 55 collected until 04-14-2020. The height of the bar is proportional to the probability of the mutation. 18 All rights reserved. No reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted April 27, 2020. . https://doi.org/10.1101/2020.04. 23.20076075 doi: medRxiv preprint Viral mutation rates Mutation rates among rna viruses Mechanisms of viral mutation Rates of evolutionary change in viruses: patterns and determinants The distribution of fitness effects caused by single-nucleotide substitutions in an rna virus Mutational and fitness landscapes of an rna virus revealed through population sequencing Evolvability of an rna virus is determined by its mutational neighbourhood Influenza vaccines: the good, the bad, and the eggs Theory of lethal mutagenesis for viruses Mutation rate and genotype variation of ebola virus from mali case sequences Quantifying the diversification of hepatitis c virus (hcv) during primary infection: estimates of the in vivo mutation rate Good-turing frequency estimation without tears Competitive distribution estimation: Why is good-turing good Genotype and phenotype of covid-19: Their roles in pathogenesis 2019-nCoV) Real-Time RT-PCR Diagnostic Panel, Catalog Number 2019-nCoVEUA-01 with 1000 reactions -For Emergency Use Only Gisaid: Global initiative on sharing all influenza data-from vision to reality Muscle: a multiple sequence alignment method with reduced time and space complexity Always good turing: Asymptotically optimal probability estimation Complexities of viral mutation rates Regularized weighted chebyshev approximations for support estimation Chebyshev polynomials, moment matching, and optimal estimation of the unseen Small-sample distribution estimation over sticky channels Alternating markov chains for distribution estimation in the presence of errors Achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins Does covid-19 hit women and men differently? u.s. isn't keeping track