key: cord-0751949-v4ubd1l6
authors: Korber, B.; Fischer, W. M.; Gnanakaran, S.; Yoon, H.; Theiler, J.; Abfalterer, W.; Hengartner, N.; Giorgi, E. E.; Bhattacharya, T.; Foley, B.; Hastie, K. M.; Parker, M. D.; Partridge, D. G.; Evans, C. M.; Freeman, T. M.; de Silva, T. I.; McDanal, C.; Perez, L. G.; Tang, H.; Moon-Walker, A.; Whelan, S. P.; LaBranche, C. C.; Saphire, E. O.; Montefiori, D. C.; Angyal, Adrienne; Brown, Rebecca L.; Carrilero, Laura; Green, Luke R.; Groves, Danielle C.; Johnson, Katie J.; Keeley, Alexander J.; Lindsey, Benjamin B.; Parsons, Paul J.; Raza, Mohammad; Rowland-Jones, Sarah; Smith, Nikki; Tucker, Rachel M.; Wang, Dennis; Wyles, Matthew D.
title: Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus
date: 2020-07-03
journal: Cell
DOI: 10.1016/j.cell.2020.06.043
sha: 9b5d22b423aa04a240766b814a16387edd321183
doc_id: 751949
cord_uid: v4ubd1l6

Summary A SARS-CoV-2 variant carrying the Spike protein amino acid change D614G has become the most prevalent form in the global pandemic. Dynamic tracking of variant frequencies revealed a recurrent pattern of G614 increase at multiple geographic levels: national, regional and municipal. The shift occurred even in local epidemics where the original D614 form was well established prior to the introduction of the G614 variant. The consistency of this pattern was highly statistically significant, suggesting that the G614 variant may have a fitness advantage. We found that the G614 variant grows to higher titer as pseudotyped virions. In infected individuals G614 is associated with lower RT-PCR cycle thresholds, suggestive of higher upper respiratory tract viral loads, although not with increased disease severity. These findings illuminate changes important for a mechanistic understanding of the virus, and support continuing surveillance of Spike mutations to aid in the development of immunological interventions.

to the fitness or antigenic profile of the virus is important to ensure the effectiveness of the vaccines and immunotherapeutic interventions as they advance to the clinic.

In response to the urgent need to develop effective vaccines and antibody-based therapeutics against SARS-CoV-2, over 90 vaccine and 50 antibody approaches are currently being explored (Cohen, 2020; Yu et al., 2020) . Most target the trimeric Spike protein, which mediates host cell binding and entry and is the major target of neutralizing antibodies Yuan et al., 2020) . Spike monomers are comprised of an N-terminal S 1 subunit that mediates receptor binding and a membrane-proximal S 2 subunit that mediates membrane fusion (Hoffmann et al., 2020a; Walls et al., 2020; Wrapp et al., 2020) . SARS-CoV-2 and SARS-CoV-1 share ~79% sequence identity (Lu et al., 2020) and both use angiotensin converting enzyme-2 (ACE2) as their cellular receptor.

Antibody responses to the SARS-CoV-1 Spike are complex. In some patients with rapid and high neutralizing antibody responses, an early decline of these responses was associated with increased severity of disease and a higher risk of death (Ho et al., 2005; Liu et al., 2006; Temperton et al., 2005; Zhang et al., 2006) . Some antibodies against SARS-CoV-1 Spike mediate antibody-dependent enhancement (ADE) of infection in vitro and exacerbate disease in animal models (Jaume et al., 2011; Wan et al., 2020; Wang et al., 2014; Yip et al., 2016) .

Most current SARS-CoV-2 immunogens and testing reagents are based on the Spike protein sequence of the Wuhan reference sequence , and first-generation antibody therapeutics were discovered based on the early pandemic infections and evaluated using the Wuhan reference sequence proteins. Alterations from the reference sequence as the virus propagates in human-to-human transmission could potentially alter the viral phenotype and/or the efficacy of immune-based interventions. Therefore, we have designed bioinformatic tools to create an "early warning" strategy to evaluate Spike evolution during the pandemic, to enable the testing of mutations for phenotypic implications and the generation of appropriate antibody breadth evaluation panels as vaccines and antibody-based therapeutics progress. Phylogenetic analysis of the global sampling of SARS-CoV-2 is being very capably addressed at the Global Initiative for Sharing All Influenza Data (GISAID) database (www.gisaid.org; Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) and Nextstrain (nextstrain.org; (Hadfield et al., 2018) . In a setting of low genetic diversity like that of SARS-CoV-2, however, with very few de novo mutational events, phylogenetic methods that use homoplasy to identify positive selection (Crispell et al., 2019) have limited statistical power. Additionally, recombination can add a confounding factor to phylogenetic reconstructions, and recombination is known to play a role in natural coronavirus evolution (Graham and Baric, 2010; Lau et al., 2011; Li et al., 2020; Oong et al., 2017; Rehman et al., 2020) , and recombinant sequences (although potential sequencing artefacts) have been found among SARS-CoV-2 sequences (De Maio et al., 2020) . Given these issues, we developed an alternative indicator of potential positive selection, by identifying variants that are recurrently becoming more prevalent in different geographic locations.

If increases in relative frequency of a particular variant are repeatedly observed in distinct geographic regions, that variant becomes a candidate for conferring a selective advantage.

Single amino acid changes are worth monitoring as they can be phenotypically relevant. Among coronaviruses, point mutations have been demonstrated to confer resistance to neutralizing antibodies in MERS-CoV (Tang et al., 2014) and SARS-CoV-1 (Sui et al., 2008; ter Meulen et al., 2006) . In the HIV Envelope, single amino acid changes are known to alter host species susceptibility , increase expression levels (Asmal et al., 2011) , change viral phenotype from Tier 2 to Tier 1, causing an overall change in neutralization sensitivity (Gao et al., 2014; LaBranche et al., 2019) , and confer complete or near complete resistance to classes of neutralizing antibodies (Bricault et al., 2019; Sadjadpour et al., 2013; Zhou et al., 2019) .

We have developed a bioinformatic pipeline to identify Spike amino acid variants that are increasing in frequency across many geographic regions by monitoring GISAID data. By early April 2020, it was clear the Spike D614G mutation was exhibiting this behavior, and G614 has since become the dominant form in the pandemic. We present experimental evidence that the G614 variant is associated with greater infectivity, as well as clinical evidence that it is associated with higher viral loads. We continue to monitor other mutations in Spike for frequency shifts at regional and global levels and to provide regular updates at a public web site (cov.lanl.gov).

Our analysis pipeline to track SARS-CoV-2 mutations in the COVID-19 pandemic is based on regular updates from the GISAID SARS-CoV-2 sequence database (GISAID acknowledgments are in Table   S1 ). GISAID sequences are generally linked to the location and date of sampling. Our website provides visualizations and summary data that allow regional tracking of SARS-CoV-2 mutations over time. Hundreds of new SARS-CoV-2 sequences are added to GISAID each day, so we have automated steps to create daily working alignments (Kurtz et al., 2004) (Fig. S1 ). The analysis presented here is based on a May 29, 2020 download of the GISAID data, when our Spike alignment included 28,576 sequences; updated versions of key figures can recreated at our website (cov.lanl.gov) . The overall evolutionary rate for SARS-CoV-2 is very low, so we set a low threshold for a Spike mutation to be deemed "of interest", and we track all sites in Spike at which 0.3% of the sequences differ from the Wuhan reference sequence, monitoring them for increasing frequency over time in geographic regions, as well as for recurrence in different geographic regions. Here we present results for the first amino acid variant to stand out by these metrics, D614G.

Increasing frequency and global distribution. The Spike D614G amino acid change is caused by an A-to-G nucleotide mutation at position 23,403 in the Wuhan reference strain; it was the only site identified in our first Spike variation analysis in early March that met our threshold criterion. At that time, the G614 form was rare globally, but gaining prominence in Europe, and GISAID was also tracking the clade carrying the D614G substitution, designating it the "G clade". The D614G change is almost always accompanied by three other mutations: a C-to-T mutation in the 5' UTR (position 241 relative to the Wuhan reference sequence), a silent C-to-T mutation at position 3,037; and a C-to-T mutation at position 14,408 that results in an amino acid change in RNA-dependent RNA polymerase (RdRp P323L). The haplotype comprising these 4 genetically linked mutations is now the globally dominant form. Prior to March 1, it was found in 10% of 997 global sequences; between March 1-March 31, it represented 67% of 14,951 sequences; and between April 1-May 18 (the last data point available in our May 29 th sample) it represented 78% of 12,194 sequences. The transition from D614 to G614 was occurred asynchronously in different regions throughout the world, beginning in Europe, followed by North America and Oceania, then Asia (Figs. 1-3, S2-S3).

We developed two statistical approaches to assess the consistency and significance of the D614-to-G614 transition. In general, to observe a significant change in the frequency of variants in a geographic region, three requirements must be met. First, both variants must at some point be cocirculating in the geographic area. Second, there must be sampling over an adequate duration to observe a change in frequency. Third, enough samples must be available for adequate statistical power to detect a difference. Both of our approaches enable us to systematically extract all GISIAD local and regional data that meet these three requirements.

Our first approach requires that there be an "onset", defined as the first day where the cumulative number of sequences reached 15 and both forms were represented at least 3 times; we further require that there be at least 15 sequences available at least two weeks after the onset. Each geographic region that meets these criteria is extracted separately based on the hierarchical geographic/political levels designated in GISAID (Fig. 1B) . A two-sided Fisher's exact test compares the counts in the pre-onset period to the counts after the two-week delay period and provides a pvalue against the null hypothesis that the fraction of D614 vs G614 sequences did not change. All regions that met the above criteria and that showed significant change in either direction (p<0.05) are included. Almost all shifted towards increasing G614 frequencies: 5/5 continents, 16/17 countries, (two-sided binomial p-value of 0.00027); 16/16 regions (p = 0.00003), and 11/12 counties and cities (p= 0.0063).

In new form due to stochastic effects or serial re-introductions, or apparent emergence due to sampling biases, the consistency of the shift to G614 across regions is striking. The increase in G614 often continued after national stay-home-orders were implemented, and in some cases beyond the 2-week maximum incubation period.

We found two exceptions to the pattern of increasing G614 frequency in Figure 1B ; details regarding these cases are shown in Figure S4 . The first is Iceland. Changes in sampling strategy during a regional molecular epidemiology survey conducted through the month of March might explain this exception (Gudbjartsson et al., 2020) . In early March, only high-risk people were sampled, the majority being travelers from countries in Europe where G614 dominated. In mid-March, screening began to include the local population; this coincided with the appearance of the D614 variant in the sequence data set. The second exception is Santa Clara county, one of the most heavily sampled regions in California (Deng et al., 2020) . The D614 variant dominates sequences from the Santa Clara Department of Public Health (DPH) to date; the G614 variant was apparently not established in that community. In contrast, a smaller set of Santa Clara county sequences, sampled mid-March to early April, were specifically noted to be from Stanford: the Stanford samples had a mixture of both 7 forms co-circulating (Fig. S4) , suggesting that the two communities within Santa Clara County are effectively distinct. A June 19 th GISAID update for several California counties is provided in Figure   S4C , and the G614 form is present in the most recent Santa Clara DPH samples.

Our second statistical approach to evaluating the significance of the D614-to-G614 transition (Fig. 3) uses the time-series data in GISAID more fully. Here we extracted all regional data from GISAID that had a minimum of 5 sequences representing each of the D614 and the G614 variants, and at least 14 days of sampling. We then modeled the daily fraction of G614 as a function of time using isotonic regression, testing the null hypothesis that this fraction does not change over time (i.e. it remains roughly flat over time, with equally likely random fluctuations of increase or decrease). We then separately tested the null against two alternative hypotheses: that the fraction of G614 either increases, or that it decreases. Figure 3A shows separate p-values for all subcountries/states and counties/cities that met the minimal criteria. 30 of 31 subcountries/states with a significant change in frequency were increasing in G614; a binomial test indicates that G614 increases are highly significantly enriched (p-value = 2.98e-09). This was also found in 17 of 19 counties/cites (p-value = 0.0007). Figure 3B shows examples for 3 cities, plotting the daily fraction of G614 as a function of time. Country summaries (similar to Fig. 3A) , and plots for all regions (similar to Fig. 3B ) are included in Data S1.

The earliest examples of sequences carrying parts of the 4-mutation haplotype that characterizes the D614G GISAID G clade were found in China and Germany in late January, and they carried 3 of the 4 mutations that define the clade, lacking only the RdRp P323L substitution (Fig. S5D ). This may be an ancestral form of the G clade. One early Wuhan sequence and one early Thai sequence had the D614G change, but not the other 3 mutations (Fig.   S5D ); these may have arisen independently. The earliest sequence we detected that carried all 4 mutations was sampled in Italy on Feb. 20 (Fig. S5D ). Within days, this haplotype was sampled in many countries in Europe.

Structural implications of the Spike D614G change. D614 is located on the surface of the Spike protein protomer, where it can form contacts with the neighboring protomer (Fig. 4A ). Cryo-EM structures (Walls et al., 2020; Wrapp et al., 2020) indicate that the sidechain of D614 and T859 of the neighboring protomer (Fig. 4B ) form a between-protomer hydrogen bond, bringing together a residue from the S 1 unit of one protomer and a residue of the S 2 unit of the other protomer (Fig. 4C) . The change to G614 would eliminate this sidechain hydrogen bond, possibly increasing mainchain 8 flexibility and altering between-protomer interactions. In addition, this substitution could modulate glycosylation at the nearby N616 site, influence the dynamics of the spatially proximal fusion peptide ( Fig. 4D ) of the neighboring protomer, or have other effects.

G614 is associated with potentially higher viral loads in COVID-19 patients but not with disease severity. SARS-CoV-2 sequences from 999 individuals presenting with COVID-19 disease at the Sheffield Teaching Hospitals NHS Foundation Trust were available, and linked to clinical data.

The Sheffield data include age, sex, date of sampling, hospitalization status (defined as outpatient, OP; inpatient, IP, requiring hospitalization; or admittance into the intensive care unit, ICU), and the cycle threshold (Ct) for a positive signal in E-gene based RT-PCR. The Ct is used here as a surrogate for relative viral loads; lower Ct values indicate higher viral loads (Corman et al., 2020) , though not all viral nucleic acids represent infectious viral particles. RT-PCR methods changed during the course of the study due to limited availability of testing kits. The first method involved nucleic acid extraction; the second method, heat treatment (Fomsgaard and Rosenstierne, 2020) . A generalized linear model (GLM) used to predict PCR Ct based on the RT-PCR method, sex, age, and D614G status showed only the RT-PCR method (p <2e-16) and D614G status (p=0.037) to be statistically significant (Fig. 5A ). Lower Ct values were observed in G614 infections. While our paper was in revision, G614-variant association with low Ct values in vivo (Fig. 5) was independently reported by two other groups (Lorenzo-Redondo et al., 2020; Wagner et al., 2020) , in preprints that have not yet been peer reviewed.

We found no significant association between D614G status and disease severity as measured by hospitalization outcomes. A comparison of D614G status and hospitalization (combining IP+ICU) was not significant (p = 0.66, Fisher's exact test), although comparing ICU admission with (IP+OP) did have borderline significance (p= 0.047) (Fig. 5B ). Regression analysis reinforced the result that G614 status was not associated with greater levels of hospitalization, but that higher age (Dowd et al., 2020; Promislow, 2020) , male sex (Conti and Younes, 2020; Promislow, 2020) and higher Ct values (lower viral loads) were each highly predictive of hospitalization. Further analysis showed that viral load was not masking a potential D614G status effect on hospitalization (see STAR Methods).

Univariate analysis also found highly significant associations between age and male sex and hospitalization (see STAR Methods).

G614 is associated with higher infectious titers of spike-pseudotyped virus. We quantified the infectious titers of pseudotyped single-cycle vesicular stomatitis virus (VSV) and lentiviral particles displaying either D614 or G614 SARS-CoV2 Spike protein. For both the VSV and lentiviral pseudotypes, G614-bearing viruses had significantly higher infectious titers (2.6 -9.3 fold increase) than their D614 counterparts; this was confirmed in multiple cell types ( Figure 6A -C). Similar results recently reported in a preprint that has not yet been peer reviewed also suggest that G614 increases both spike stability and membrane incorporation .

TMPRSS2, a type-II transmembrane serine protease, cleaves the viral spike after receptor binding to enhance entry of MERS-CoV, SARS-CoV and SARS-CoV-2 (Hoffmann et al., 2020b; Kleine-Weber et al., 2018; Matsuyama et al., 2020; Millet and Whittaker, 2014; Park et al., 2016; Shulla et al., 2011; Zang et al., 2020) . Spike 614 is in a pocket adjacent to the fusion peptide near the expected TMPRSS2 cleavage site, suggesting there could be differences in the propensity and/or requirement for TMPRSS2 of the G614 variant. To test this hypothesis, we infected 293T cells stably expressing ACE2 receptor in the presence or absence of TMPRSS2, and quantified the titer of infectious virus.

We found similar fold-changes in the titers between D614 and G614 regardless of TMPRSS2 expression (Fig. 6A) . Hence, entry of G614-bearing viruses in 293T-ACE2 cells, as compared to D614-bearing viruses, is not enhanced by TMPRSS2. Further studies are required to determine if the G614 variant shows increased titers in lung cells, which may recapitulate native protease expression levels more faithfully, and to determine if this variant increases the fitness of authentic SARS-CoV-2.

We also tested if the D614G variations would be similarly neutralized by polyclonal antibody. The convalescent sera of six San Diego residents, likely infected in early to mid-March when both D614 and G614 were circulating, each demonstrate equivalent or better neutralization of G614-bearing pseudovirus compared to D614-bearing pseudovirus (Fig 6D-E) . Although we do not know with which virus each of these individuals were infected, these initial data suggest that despite increased fitness in cell culture, G614-bearing virions are not intrinsically more resistant to neutralization by convalescent sera.

Spike has very few mutations overall. A small set has reached >0.3% of the global population sample, the threshold for automatic tracking at the cov.lanl.gov website ( Fig. 7A and B, details provided in Table S2 ). Regions in the alignment where entropy is relatively high compared to the rest of Spike (i.e., local clusters of rare mutations) are also tracked (Table S2 ). Genetic mutations of interest are mapped as amino acid changes onto a Spike structure (Figure 4 ). The mutation resulting in the signal peptide L5F change recurs many times in the tree and is stably maintained in about 0.6% of the global GISAID data. There are several clusters of mutations in region of the spike gene encoding the N-terminal domain (NTD) and RBD which are potential targets for neutralizing antibodies (Chen et al., 2017; Zhou et al., 2019a; (Sui et al., 2008; Tang et al., 2014; ter Meulen et al., 2006) . The RBD cluster (positives 475-483) spans two positions, at 475 and 476, that are located within 4Å of bound ACE2 (Fig. 4D ) (Yan et al., 2020) . The fusion peptide contains a cluster of amino acid changes between 826-839; this cluster is highlighted in Fig. 7 to illustrate our web-based tools for tracking variation (Fig. 7A-C) . The fusion core of HR1 , next to the helix break in pre-fusion Spike, also contains a cluster of amino acid changes between 93E-940 (Fig. 4E ). The motif SXSS (937-940) may enhance the association of helices (Dawson et al., 2002; Salamango and Johnson, 2015) . The cytoplasmic tail of Spike also contains a site of interest, P1263L.

Our data show that over the course of one month, the variant carrying the D614G Spike mutation became the globally dominant form of SARS-CoV-2. Phylogenetic tracking of SARS-CoV-2 variants at Nextstrain reveals complex webs of evolutionary and geographical relationships (nextstrain.org; (Hadfield et al., 2018) ); travelers dispersed G614 variants globally, and likely would have introduced and reintroduced G614 variants into different locations. Still D614 prevalent epidemics were very well established in many locations when G614 first began to appear (see Fig. S2 for examples). The mutation that causes the D614G amino change is transmitted as part of a conserved haplotype defined by 4 mutations that almost always track together ( Fig. S5 and S6 ). The pattern of increasing G614 frequency within many different populations where both D614 and G614 were co-circulating is highly significant, suggesting that G614 may be under positive selection (Fig. 1b, 3) . We also found G614 to be associated with higher levels of viral nucleic acid in the upper respiratory tract in human patients ( Fig. 5 ) (suggestive of higher viral loads), and with higher infectivity in multiple pseudotyping assays (Fig. 6 ).

Given that most G614 variants belong to the G clade lineage, phylogenetic methods that depend upon recurrence of mutational events for their signal are poorly powered to resolve whether D614G is under positive selection. The GISAID data, however, provided the opportunity to look into the relationships among the SARS-CoV-2 variants in the context of time and geography, enabling us to track the increase in frequency of G614 as an early indicator of possible positive selection. This approach is potentially subject to founder effects and sampling biases, and so we generally view this strategy as simply an early indicator of an amino acid change that should be monitored further and tested. The G614 variant stood out, however, in our early detection framework for several reasons.

First was the consistency of increase across geographic regions, which was highly significantly non-random (Figs. 1b and Fig. 3 ). Second, if the two forms were equally likely to propagate, one would expect the D614 form to persist in many locations where the G614 form was introduced into the ongoing well-established D614 epidemics. Instead, we found that even in such cases, G614

increased S2 and S3.) . Third, the increase in G614 frequency often continued well after national stay-at-home orders were in place, when serial reseeding from travelers was likely to be significantly reduced (Figs. 2, S2 and S3.) .

Our global tracking data show that the G614 variant in Spike has spread faster than D614. We interpret this to mean that the virus is likely to be more infectious, a hypothesis consistent with the higher infectivity observed with G614 Spike-pseudotyped viruses we observed in vitro (Fig. 6) , and the G614 variant association with higher patient Ct values, indicative of potentially higher in vivo viral loads (Fig. 5) . Interestingly, we did not find evidence of G614 impact on disease severity; i.e., it was not significantly associated with hospitalization status. However, an association between the G614 variant and higher fatality rates has been reported in a comparison of mortality rates across countries, although this kind of analysis can be complicated by different availability of testing and care in different nations (Becerra-Flores and Cardozo, 2020).

While higher infectiousness of the G614 variant may fully account for its rapid spread and persistence, other factors should also be considered. These include epidemiological factors, as viral spread also depends on who it infects, and epidemiological influences can also cause changes in genotype frequency to mimic evolutionary pressures. In all likelihood, a combination of evolutionary selection for G614 and the founder's effects of being introduced into highly mobile and connected populations may have together contributed in part to its rise. The G-clade mutations in the 5' UTR, or in the RdRP protein might also have impact. In addition, there could be immunological consequences resulting from the G614 change in Spike. The G614 variant is sensitive to neutralization by polyclonal convalescent sera (Fig. 5) , which is encouraging in terms of immune interventions, but it will be important to determine whether the D614 and G614 forms of SARS-CoV-2 are differentially sensitive to neutralization by vaccine-elicited antibodies or by antibodies produced in response to infection with either form of the virus. Also, if the G614 variant is indeed more infectious than the D614 form ( Fig.   6 ), it may require higher antibody levels for protection by vaccines or antibody therapeutics than the D614 form. Antibodies against an immunodominant linear epitope spanning Spike 614 in SARS-CoV-1 were associated with ADE activity (Wang et al., 2016) , and so it is possible that this mutation may impact ADE.

Tracking mutations in the Spike gene has been our primary focus to date because of its relevance to vaccine and antibody-based therapy strategies currently under development. Such interventions take months to years to develop. For the sake of efficiency, contemporary variation should be factored in during development to ensure that the interventions will be effective against circulating variants when they are eventually deployed. To this end, we built a data-analysis pipeline to enable the exploration of potentially interesting mutations on SARS-CoV-2 sequences. The analysis is updated daily as the data become available through GISAID, enabling experimentalists can make use of the most current data available to inform vaccine development, reagents for evaluating antibody response, and experimental design. The speed with which G614 variant became the dominant form globally suggests the need for continued vigilance.

Shifts in frequency towards the G614 variant in any given geographic region could in principle result from either founder effects or sampling biases; it was the consistency of this pattern across regions where both forms of the virus were initially co-circulating that led us to suggest that the G614 form might be transmitted more readily due to an intrinsic fitness advantage; however, systematic biases across many regions could impact the levels of significance we observed. The lack of association between G614 and hospitalization that we report may miss impacts on disease severity that are more subtle than we can detect. The experimental approach taken here to acquire laboratory evidence of increased fitness of the D614G mutation is based on two different pseudovirus models of infection in established cell lines. The extent to which this model faithfully recapitulates wild-type virus infection in natural target cells of the respiratory system is still being determined, and our laboratory experiments do not directly address the biology and mechanics of natural transmission. Infectiousness and transmissibility are not always synonymous, and more studies are needed to determine if the D614G mutation actually led to an increase number of infections, not just higher viral loads during infection.

We encourage others to study this phenomenon in greater detail with wild-type virus in natural infection and varied target cells (Hou et al., 2020) , and in relevant animal models. Finally, the neutralization assays performed were based on sera from SARS-CoV-2 infected individuals with an unknown D614G status. Thus, while they show that the G614 variants are neutralization sensitive, more work is needed to resolve whether the potency of neutralization is affected when the variant that initiated the immune response differs from the test variant, or when monoclonal antibodies are used.

Nguyen for sharing preliminary MD data. We thank Alessandro Sette and Shane Crotty for survivor sera, and Sharon Schendel for manuscript edits. We acknowledge Barney research. We gratefully acknowledge the team at GISAID for creating SARS-CoV-2 global database, and the many people who provided sequence data (Table S1 ). 

The authors declare no competing interests. and G614 (blue) variants in different continents between January 12 to May 12. The measure of interest is the relative frequency over time. The shape of the overall curve just reflects sample availability: Sequencing was more limited earlier in the epidemic (hence the left-hand tail), and there is a time lag between viral sampling and sequence availability in GISAID (hence the right-hand tail).

Weekly running count plots were generated with Python Matplotlib (Hunter, 2007) ; all elements of this figure are frequently updated at cov.lanl.gov. essentially G614 epidemics when sampling began, but even in these cases small traces of D614 found early on were soon lost (e.g. France and Italy). The Italian epidemic started with D614 clade, but Italy had the first sampled case of the full G614 haplotype, and had shifted to all G614 samples prior to March 1 (see Fig. S5 ). European nations that began with a mixture of D614 and G614 most clearly reveal the frequency shifts (e.g. Germany, Spain and the United Kingdom). The UK is richly sampled, and so is subdivided into smaller regions: England, Wales, and Scotland, then further divided to display two well-sampled English cities. Even in settings with very well-established D614 epidemics (e.g. Wales and Nottingham, also see Figs. S2 and S3), G614 becomes prevalent soon after its appearance. The increase in G614 frequency often continues well after stay-at-home orders are in place (pink line) and past the subsequent two-week incubation period (pink transparent box).

The figures shown here can be recreated with contemporary data from GISAID at the cov.lanl.gov website. UK stay-at-home order dates were based on the date of the national proclamation, others were documented on the web (Schramm and Melin, 2020). decreasing time window in California was driven by sampling from Santa Clara county, a rare region that has retained the D614 form (Fig. S4 ). In the May 29 th data set used here, Santa Clara county was sampled later in May than any other region in the California, so the California G614 frequency dips at this last available time point. If Santa Clara county is removed from the California sample, the pattern of increasing levels of G614 is restored (red asterisk). B. Three examples for cities, plotting the daily fraction of G614 as a function of time, and accompanied by plots of running weekly counts.

The dot size is proportional to the number of sequences sampled that day. The staircase line is the maximum likelihood estimate under the constraint that the logarithm of the odds ratio is nondecreasing. Two typical examples are shown highlighted in blue (Sydney and Cambridge), and one exception is shown, highlighted in orange (Yakima). Yakima had a brief sampling window enriched for G614 early in the sampling period, but otherwise G614 maintained a low frequency. Summaries and plots for all regional data at levels 2-4 (included country) are included in Data S1. loads. The PCR method was changed part way through April due to shortages of nucleic-acid extraction kits (Fomsgaard and Rosenstierne, 2020) . Ct levels for the two PCR methods (nucleic acid extraction vs. simple heat inactivation) differ, and so we used a GLM to evaluate statistical impact of D614G across methods. B) D614G status was not statistically associated with hospitalization status (outpatient (OP), inpatient (IP), or ICU) as a marker of disease severity, but age was highly correlated. The number of counts in each category is noted in the upper right-hand corner of each graph. See the main text and methods for statistical details.

Spike grows to higher titer than D614 Spike in Vero, 293T-ACE2 and 293T-ACE2-TMPRSS2 cells as measured in terms of focus forming units (ffu). Four asterisks (****) indicates a p <0.0001 by a Student's t test in pairwise comparisons. Experiments were repeated twice, each time in triplicate.

Using a GLM to assess viral infectivity of the D614 and G614 variants across cell types and to account for repeat experiments, we found the G614 variant had average 3-fold higher infectious titer than D614, and that this difference was highly significant (p=9x10 -11 ) (see STAR Methods). B,C)

Recombinant lentiviruses pseudotyped with the G614 Spike were more infectious than corresponding D614 S-peudotyped viruses in (B) 293T/ACE2 (6.5-fold increase) and (C) TZM-bl/ACE2 cells (2.8fold increase, p <0.0001). Relative luminescence units (RLUs) of Luc reporter gene expression (Naldini et al., 1996) were standardized to p24 content of the pseudoviruses (p24 content of pseudoviruses for 293T/ACE2 cells: D614= 269 ng/ml, G614= 255 ng/ml; p24 content of pseudoviruses for TZM-bl/ACE2 cells, D614= 680 ng/ml, G614= 605 ng/ml). Background RLU was measured in wells that received cells but no pseudovirus. D,E) Convalescent sera from six individuals in San Diego (4765-4767 and 4774-4777) can neutralize both D614-(orange) and G614-(blue) bearing VSV pseudoviruses. Samples 1592 and 1616 (grey) are negative control normal human sera. % relative infection is plotted vs. log polyclonal antibody concentration. Other sites of interest cluster in a main lineage, but are occasionally found in other parts of the tree in distant geographic regions, thus likely to be recurring at a low level. Build parsimony trees. A brief parsimony search (parsimony ratchet, with 5 replicates) is performed with 'oblong' (Goloboff, 2014) This is intended as an efficient clustering procedure rather than an explicit attempt to achieve an accurate phylogenetic reconstruction, but it appears to yield reasonable results in this situation of a very large number of sequences with a very small number of changes, where more complex models may be subject to overfitting. When multiple most-parsimonious trees are found, only the shortest of these (under a p-distance criterion) is retained. Distance scoring is performed with PAUP* (Swofford, 2003) . B. Table indicating Contemporary versions of these figures can be created at cov.lanl.gov. Care should be taken to try to avoid systematic sequencing errors and processing artefacts among rare variants (for example, see Figure  1 . The intent of these procedures is to generate, for each of several regions, a set of contiguous codon-aligned sequences, complete in that region, without extensive uncalled bases, large gaps, or regions that are unalignable or highly divergent, in reasonable running time for n > 30,000 ~30kb viral genomes. This allows daily processing of GISAID data to enable us to track mutations. This process provides the foundational data to enable the generation of Figs. 1-3 and 7. A. Processing procedures: 1. Download all SARS-CoV-2 sequences from GISAID.org (34,607 as of 2020-05-29). The downloaded sequences are stored in compressed form (via bzip2: https://sourceware.org/git/bzip2.git).

2. Align sequences to the SARS-CoV-2 reference sequence (NC_045512), trim to desired endpoints, and filter for coverage and quality. These steps are incorporated in a single Perl script, 'align_to_ref.pl', briefly summarized here: sequences are compressed for identity, then mapped against the given reference sequence using 'nucmer' from the 'MUMmer' package (Kurtz et al., 2004) . The nucmer 'delta' file contains locations of matching regions and is parsed and used to, first, partition the sequences into "good" and "bad" subsets, and then to generate alignments from the "good" sequences.

A series of criteria is used successively to exclude sequences with large internal gaps, excessive five-and threeprime gaps, large numbers of mismatches or ambiguities (>30) overall, or regions with a high concentration of mismatches or ambiguities (>10 in any 100 nt subsequence): the counts of these categories of "bad" sequence are shown, for different regional genome alignments.

We then create the following different regional subalignments: CODING-REGIONS 2 ("FULL", from the 5'-most start-codon (orf1ab) to the 3'-most stop-codon (ORF10), NC_045512 bases 266-29,674; SPIKE, the complete surface glycoprotein coding region, bases 21,563-25,384; NEAR-COMPLETE ("NEARCOMP", the most-commonly-sequenced region of the genome, bases 55-29,836; COMPLETE, matching the NC_045512 sequence from start up to the poly-A tail, bases 1-29,870; 5' UTR, the five-prime untranslated region, bases 1-265 only. Generally speaking, the smaller the region, the more sequences are included.

Sequences are trimmed to the extent of the reference (with minimum allowed gaps at 5' and 3' ends), following which the pairwise alignments are generated from the matching regions, and a multiple sequence alignment is constructed from the pairwise alignments.

3. De-duplicate 1 . To reduce computational demands, sequences are compressed by identity following trimming to the desired region, by computing a hash value for each sequence (currently the SHA-1 message digest, 160 bits encoded as a 40-character hex string). To prevent the loss of parsimonyinformative characters when they occur in identical strings, however, multiple sequences are reduced to a minimum of two occurrences. 4. Codon-align. Gaps are introduced into the entire compressed alignment so that the alignment column containing the last base of each codon has a number divisible by three; this simplifies processing of translations. Code for this procedure is derived from the GeneCutter tool from the LANL HIV database (https://www.hiv.lanl.gov/content/sequence/GENE_CUTTER/cutter.html).

5. Partition (full/spike-only). For subalignments that encompass the spike protein and substantial additional sequence, the spike region is extracted separately, to allow matched comparisons.

6. Build parsimony trees. A brief parsimony search (parsimony ratchet, with 5 replicates) is performed with 'oblong' (Goloboff, 2014) This is intended as an efficient clustering procedure rather than an explicit attempt to achieve an accurate phylogenetic reconstruction, but it appears to yield reasonable results in this situation of a very large number of sequences with a very small number of changes, where more complex models may be subject to overfitting. When multiple most-parsimonious trees are found, only the shortest of these (under a p-distance criterion) is retained. Distance scoring is performed with PAUP* (Swofford, 2003) .

7. Re-duplicate (expand, i.e., uncompress). The original sequence names and occurrence counts are restored to FastA format files and the appropriate leaf taxa added to the parsimony trees. 8. Sort alignment by tree. Sequences in the FastA files are sorted by the expanded tree, allowing patterns of mutation to be discerned by inspection.

9. Mutations of interest can be readily tracked on the trees to resolve whether they are identified in predominantly in single clades or distributed throughout the tree and likely to be recurring (e.g. Fig. 7 (sites of interest with low frequency amino acid substitutions) and Fig. S6 (Site 614) ). Fig. 2 and S3, and Fig. 1 has details about how to read these figures. When a particular stay-at-home order date was known for a state or county it is shown as a pink line, followed by a light pink block indicating the maximum two-week incubation time. Different counties in California had different stay-at-home order dates (Mar. 16-19) so are not highlighted, but more detail can be seen regarding California in Fig. S4 . The decline in D614 frequency often continues well after the stay-at-home orders were initiated, and sometimes beyond the 14-day maximum incubation period, when serial reintroduction of the G614 would be unlikely. On the right, Washington State is shown, with details from two heavily sampled counties, Snohomish and King. Both counties had wellestablished ongoing D614 epidemics when G614 variants were introduced, undoubtedly by travelers. Washington state's stay-at-home order was initiated March 24. At this time there were 1170 confirmed cases in King County, and 614 confirmed cases in Snohomish County. (Confirmed COVID19 case count data from: COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University). Testing was limited, and so this is lower bound on actual cases. Of the sequences sampled by March 24, 95% from King County (153/161) and 100% from Snohomish County (33/33) were the original D614 form (Part B, details at cov.lanl.gov). By mid-April, D614 was rarely sampled. Whatever the geographic origin of the G614 variants that entered these counties, and whether one or if multiple G614 variants were introduced, the rapid expansion of G614 variants occurred in the framework of well-established local D614 variant epidemics. Santa Clara county is one of the two exceptions to the pattern of D614 decline in Fig. 1B : details are provided in Fig. S4A and C. Fig. 2 and Fig. S2 , and Fig. 1 has details about how to read these figures. The plot representing national sampling in Australia is on the left, with two regional subsets of the data on the right. In each case a local epidemic started with the D614 variant, and despite being well established, the G614 variant soon dominates the sampling. Only limited recent sampling from Asia is currently available in GISAID; to include more samples on the map the 10-day period between March 11-20, is shown rather than the period between March 21-30; even the limited sampling mid-March the supports the repeated pattern of a shift to G614. The Asian epidemic was overwhelmingly D614 through February, and despite this, G614 repeatedly becomes prominent in sampling by mid-March. Fig. 1B . Many samples from the Santa Clara County Department of Public Heath (DPH) were obtained from March into May, and D614 has steadily dominated the local epidemic among those samples. The subset of Santa Clara county samples specifically labeled "Stanford", however, were sampled over a few weeks mid-March through early April, and have a mixture of both the G614 and D614 forms. These distinct patterns suggest relatively little mixing between the two local epidemics. Why Santa Clara county DPH samples should maintain the original form is unknown, but one possibility is that they may represent a relatively isolated community that had limited exposure to the G614 form, and G614 may not have had the opportunity to become established in this community -though this may be changing, see Part C. The local stay-at-home orders were initiated relatively early, March 16, 2020. B. Details regarding Iceland, the only country with an exceptional pattern from Fig. 1B . All Icelandic samples are from Reykjavik, and only G614 variants were initially observed there, with a modest but stable introduction of the original form D614 in mid-March. This atypical pattern might be explained by local sampling. The Icelanders conducted a detailed study of their early epidemic (Gudbjartsson et al., 2020) , and all early March samples were collected from high risk travelers from Europe and people in contact with people who were ill; the majority of the traveler samples from early March were from people coming in from Italy and Austria, and G614 dominated both regions. On March 13, they began to sequence samples from local population screening, and on March 15, more travelers from the UK and USA with mixed G614/D614 infections began to be sampled in the high-risk group, and those events were coincident with the appearance of D614. C. Updated data regarding California from the June 19, 2020 GISAID sampling. Most of the analysis in this paper was undertaken using the May 29, 2020 GISAID download, but as California was an interesting outlier, and more recent sampling conducted while the paper was under review was informative, we have included some additional plots from California data that were available at the time of our final response to review, on June 19 th . Informative examples from well-sampled local regions are shown. Stay-at-home order dates are shown as a pink line, followed by a light pink block indicating the maximum two-week incubation time. N indicates the number of available sequences. Overall California, and specifically, San Diego and San Joaquin, show a clear shift from D614 to G614. The transition for San Joaquin was well after the stay-at-home orders and incubation period had passed. San Francisco shows a trend towards G614. Santa Clara DPH, which was essentially all D614 in our May 29 th GISAID download, had 7 G614 forms sampled in late May that were evident in our June 19 th GISAID download. Ventura is an example of a setting that was essentially all G614 when it began to be sampled significantly in early April, so a transition cannot be tracked; i.e. we cannot differentiate in such cases whether the local epidemic originated as a G614 epidemic, or whether it went through a transition from D614 to G614 prior to sampling. The figures in Parts A, B, and C can be recreated with more current data at https://cov.lanl.gov. Figures 1 and 2 . Weekly running counts of the earliest forms of the virus carrying G614. B. The GISAID G clade is based on a 4 base haplotype that distinguishes it from the original Wuhan form. Part B shows the tallies of all of the variants among the 4 base mutations that are the foundation of the G clade haplotype, versus the bases found in the original Wuhan reference strain, and highlights of some of the earliest identified sequences bearing these mutations. We first consider just the 3 mutations that are in coding regions, to enable using an alignment that contains a larger set of sequences. One is in the RdRp protein (nucleotide C14408T resulting in a P323L amino acid change), one in Spike (nucleotide A23403G resulting in the D614G amino acid change) and one is silent (C3037T). The other mutation is in the 5' UTR (C241T), and tallies based on all 4 positions are done separately, as they are based on an alignment with fewer sequences. The earliest examples of a partial haplotype, TTCG, are found in Germany and China (Parts A and B). This form was present in Shanghai but never expanded (Part A). A cluster of infections that all carried this form were identified in Germany, but they did not expand and were subsequently replaced with the original Wuhan form, only to be replaced again by sequences that carry the full 4 base haplotype variant TTTG. The first example in our alignments (see Fig. S1 ) found to carry the full 4 base haplotype was sampled in Italy on Feb. 20. The first cases in Italy were of the original Wuhan form CCCA form, but by the end of February TTTG was the only form sampled in Italy, and it is the TTTG form that has come the dominate the pandemic. Of note, the TTCG form did not expand, and it lacked the RdRp P323L change, raising the possibility that the P323L change may contribute to a selective advantage of the haplotype. TTTG and CCCA are almost always linked in SARS-CoV-2 genomes, >99.9% of the time (Part B). The number of cases where the clade haplotype is disrupted, and the form of the disruption, are all noted in part B (not including ambiguous base calls). Some cases of a disrupted haplotype may be due to recombination events and not de novo mutations. Given the fact that these two forms are were co-circulating in many communities throughout the spring of 2020, and the fact that disruptions in the 4 base pattern are rare, suggests that recombination is overall relatively rare among pandemic sequences. Figure 7 . This tree the same as the tree shown in Fig. 6A , but highlights complementary information: the G614 substitution, and patterns of bases that underly the clades. It is based on the "FULL" alignment of 17,760 sequences, from the June 2 alignment, described in the pipeline in Fig. S1 , from the beginning of (orf1ab) to the last stop-codon (ORF10), NC_045512 bases 266-29,674). The outer element is a radial presentation of a full-codingregion parsimony tree; branches are colored by the global region of origin for each virus isolate. The inner element is a radial bar chart showing the identity of common mutations (any of the top 20 single-nucleotide mutations from the June 2 alignment), so that sectors of the tree containing a particular mutation at high frequency are subtended by an inner colored arc; mutations not in the top 20 are presented together in gray. The tree is rooted on a reference sequence derived from the original Wuhan isolates (GenBank accession number NC_045512), at the 3 o'clock position. Branch ends representing sequence isolates bearing the D614G change are decorated with a gray square; sectors of the tree containing that mutation are subtended by a dark blue arc in the inner element; other mutations are denoted by different colors. As an example, in this tree, the region from approximately 12:30 to 3 o'clock represents GISAID's "GR" clade, defined both by mutations we are tracking in this paper that carry the G614 variant (the GISAID G clade, defined by mutations A23403G, C14408T, C3037T, and a mutation in the 5' UTR (C241T, not shown here), and an additional 3-position polymorphism: G28881A + G28882A + G28883C. These base substitutions are contiguous and result two amino acid changes, including N-G204R, hence GISAID's "GR clade" name. Close examination of this triplet in sequences from the Sheffield dataset suggests the mutations are not a sequencing artifact. The outer phylogenetic tree was computed using oblong (see STAR Methods), and plotted with the APE package in R. The inner element is a bar chart plotted with polar coordinates using the gglot2 package in R. The frequency of the GR clade appears to be increased in the UK and Europe as a subset of the regional G clade expansion, given that both carry G614D. files from nanopore sequencing data of amplicons produced by the Artic network protocol. Raw data from amplicon 81 contains a portion of adapter sequence which is homologous to the reference genome, apart from the C variants which lead to a S943P mutation call. This region is therefore included in variant calling if location-based trimming is not carried out. Subsequent panels show that this region is soft clipped when trimming adapters and primers and is therefore not available for variant calling. B. Base frequencies at position 24389 in 23 samples from the Sheffield data show that C is present in half of the reads in the raw data, but is absent from trimmed and primer trimmed data. This figure is also associated with the Star Methods.

Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Bette Korber (btk@lanl.gov).

This study did not generate new unique reagents.

All sequence data used here are available from The Global Initiative for Sharing All Influenza Data (GISAID), at https://gisaid.org. The user agreement for GISAID does not permit redistribution of sequences.

Web-based tools to recreate much of the analyses provided in this paper but based on contemporary GISAID data downloads are available at cov.lanl.gov.

Code to create the alignments as described in Fig. S1 and to perform the Isotonic regression analysis in Fig. 3 

Human Subjects 999 individuals presenting with active COVID-19 disease were sampled for SARS CoV-2 sequencing at Sheffield Teaching Hospitals NHS Foundation Trust, UK using samples collected for routine clinical diagnostic use. This work was performed under approval by the Public Health England Research Ethics and Governance Group for the COVID-19 Genomics UK consortium (R&D NR0195). SARS-CoV-2 sequences were generated using samples taken for routine clinical diagnostic use from 999 individuals presenting with active COVID-19 disease: 593 female, 399 male, 6 no gender specified; ages 15-103 (median 55) years.

Samples for PCR detection of SARS-CoV-2 (Fig. 5A ) were all obtained from either throat or combined nose/throat swabs. Nucleic acid was extracted from 200µl of sample on MagnaPure96 extraction 23 platform (Roche Diagnostics Ltd, Burgess Hill, UK). SARS-CoV-2 RNA was detected using primers and probes targeting the E gene and the RdRp genes for routine clinical diagnostic purposes, with thermocycling and fluorescence detection on ABI Thermal Cycler (Applied Biosystems, Foster City, United States) using previously described primer and probe sets (Corman et al., 2020) . Nucleic acid from positive cases underwent long-read whole genome sequencing (Oxford Nanopore Technologies (ONT), Oxford, UK) using the ARTIC network protocol (accessed the 19 th of April, https://artic.network/ncov-2019.) Following base calling, data were demultiplexed using ONT Guppy using a high accuracy model. Reads were filtered based on quality and length (400 to 700bp), then mapped to the Wuhan reference genome and primer sites trimmed. Reads were then downsampled to 200x coverage in each direction. Variants were called using nanopolish (https://github.com/jts/nanopolish) and used to determine changes from the reference. Consensus sequences were constructed using reference and variants called.

Plasmids for full-length SARS-Cov-2 Spike were generated from synthetic codon-optimized DNA (Wuhan-Hu-1 isolate, GenBank: MN908947.3) through sub-cloning into the pHCMV3 expression vector, with a stop codon included prior to the HA tag. The D614G variant was generated by sitedirected mutagenesis. Positive clones were fully sequenced to ensure that no additional mutations were introduced.

Lentiviruses for stable cell line production were generated by seeding 293T cells at a density of 1x10 6 cells/well in a 6-well dish. Once the cells reached confluency, they were transfected with 2ug pCaggs-VSV-G, 2ug of lentiviral packaging vector pSPAX2, and 2ug of lentiviral expression plasmid pCW62 encoding ACE2-V5 and the puromycin resistance gene (pCW62-ACE2.V5-PuroR) or TMPRSS2-FLAG and the blasticidin resistance gene (pCW62-TMPRSS2.FLAG-BlastR) using Trans-IT transfection reagent according to manufacturer's instructions. 24 hours post-transfection, media was replaced with fresh DMEM containing 10% FBS and 20mM HEPES. 48 hours post-transfection, supernatants were collected and filtered using a 0.45um syringe filter (VWR Catalog #28200-026).

293T-ACE2 cells were generated by seeding 293T cells at a density of 1x10 6 cells/well in a 6-well dish. At confluency, cells were transduced with 100uL of ACE2.V5-PuroR lentivirus. 48 hours posttransduction, cells were placed under 5ug/ml puromycin. 293T-ACE2+TMPRSS2 cells were generated by seeding 293T-ACE2 cells at a density of 1x10 6 cells/well in a 6-well dish. At confluency, cells were transduced with 100uL of TMPRSS2.FLAG-BlastR lentivirus. 48 hours post-transduction, cells were placed under 10ug/ml blasticidin selection.

Recombinant SARS-CoV-2-pseduotyped VSV-∆G-GFP were generated by transfecting 293T cells with phCMV3 expressing the indicated version of codon-optimized SARS-CoV-2 Spike using TransIT according to the manufacturer's instructions. At 24 hr post-transfection, the medium was removed, and cells were infected with rVSV-G pseudotyped ∆G-GFP parent virus (VSV-G*∆G-GFP) at MOI = 2 for 2 hours with rocking. The virus was then removed, and the cells were washed twice with OPTI-MEM containing 2% FBS (OPTI-2) before fresh OPTI-2 was added. Supernatants containing rVSV-SARS-2 were removed 24 hours post-infection and clarified by centrifugation. Viral titrations were performed by seeding cells in 96-well plates at a density sufficient to produce a monolayer at the time of infection. Then, 10-fold serial dilutions of pseudovirus were made and added to cells in triplicate wells. Infection was allowed to proceed for 12-16 hr at 37 ˚C. The cells were then fixed with 4% PFA, washed two times with 1xPBS and stained with Hoescht (1ug/mL in PBS). After two additional washes with PBS, pseudovirus titers were quantified as the number of fluorescent forming units (ffu/mL) using a CellInsight CX5 imager (ThermoScientific) and automated enumeration of cells expressing GFP.

Additional assessments of corresponding D614 and G614 Spike pseudotyped viruses were performed by using lentiviral vectors and infection in 293T/ACE2.MF and TZM-bl/ACE2.MF cells (both cell lines kindly provided by Drs. Mike Farzan and Huihui Mu at Scripps). Cells were maintained in DMEM containing 10% FBS, 1% Pen Strep and 3 ug/ml puromycin. An expression plasmid encoding codon-optimized full-length spike of the Wuhan-1 strain (VRC7480), was provided by Drs. Barney Graham and Kizzmekia Corbett at the Vaccine Research Center, National Institutes of Health (USA). The D614G amino acid change was introduced into VRC7480 by site-directed mutagenesis using the QuikChange Lightning Site-Directed Mutagenesis Kit from Agilent Technologies (Catalog # 210518). The mutation was confirmed by full-length spike gene sequencing. Pseudovirions were produced in HEK 293T/17 cells (ATCC cat. no. CRL-11268) by transfection using Fugene 6 (Promega Cat#E2692). Pseudovirions for 293T/ACE2 infection were produced by co-transfection with a lentiviral backbone (pCMV ∆R8.2) and firefly luciferase reporter gene (pHR' CMV Luc) (Naldini et al., 1996) . Pseudovirions for TZM-bl/ACE2 infection were produced by co-transfection with the Envdeficient lentiviral backbone pSG3∆Env (kindly provided by Drs Beatrice Hahn and Feng Gao). Culture supernatants from transfections were clarified of cells by low-speed centrifugation and filtration (0.45 µm filter) and used immediately for infection in 96-well culture plates. 293T/ACE2.MF cells were preseeded at 5,000 cells per well in 96-well black/white culture plates (Perkin-Elmer Catalog # 6005060) one day prior to infection. Sixteen wells were inoculated with 50 ul of a 1:10dilution of each pseudovirus and incubated for three days. Luminescence was measured using the Promega Luciferase Assay System (Catalog # E1501). For infection of TZM-bl/ACE2.MF cells, 10,000 freshly trypsinized cells were added to 16 wells of a 96-well clear culture plate (Fisher Scientific) and inoculated with undiluted pseudovirus. Luminescence was measured after 2 days in a solid black plate using the Britelite Plus Reporter Gene Assay System (Perkin-Elmer). Luminescence in both assays was measured using a PerkinElmer Life Sciences, Model Victor2 luminometer. HIV-1 p24 content (produced by the backbone vectors) was quantified using the Alliance p24 ELISA Kit (PerkinElmer Health Sciences, Cat# NEK050B001KT). Reported relative luminescence units (RLUs) were adjusted for p24 content.

Pre-titrated amounts of rVSV-SARS-CoV-2 (D614 or G614 variant) were incubated with serially diluted human sera at 37 °C for 1 hr before addition to confluent Vero monolayers in 96-well plates. Infection proceeded for 12-16 hrs at 37 °C in 5% CO 2 before cells were fixed in 4% paraformaldehyde and stained with 1ug/mL Hoescht. Cells were imaged using a CellInsight CX5 imager and infection was quantitated by automated enumeration of total cells and those expressing GFP. Infection was normalized to the percent cells infected with rVSV-SARS-CoV-2 incubated with normal human sera. Data are presented as the relative neutralization for each concentration of sera.

The Global Initiative for Sharing All Influenza Data (GISAID) (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) has been coordinating SARS-CoV-2 genome sequence submissions and making data available for download since early in the pandemic. At time of this writing, hundreds of sequences were being added every day. These sequences result from extraordinary efforts by a wide variety of institutions and individuals: while an invaluable resource, but are mixed in quality. The complete sequence download includes a large number of partial sequences, with variable coverage, and extensive 'N' runs in many sequences. To assemble a high-quality dataset for mutational analysis, we constructed a data pipeline using some off-the-shelf bioinformatic tools and a small amount of custom code.

From theSARS-CoV-2 sequences available from GISAID, we derived a "clean" codon-aligned dataset comprising near-complete viral genomes, without large insertions or deletions ("indels") or runs of undetermined or ambiguous bases. For convenience in mutation assessment, we generated a codonbased nucleotide multiple sequence alignment, and extracted translations of each reading frame, from which we generated lists of mutations. The cleaning process was in general a process of deletion, with alignment of retained sequences; the following criteria were used to exclude sequences:

1.

Fragmented matching (> 20 nt gap in match to reference) 2.

Gaps at 5' or 3' end (> 3 nt) 3.

High numbers of mismatched nucleotides (> 20), 'N' or other ambiguous IUPAC codes.

Regions with concentrated ambiguity calls: >10 in any 50 nt window)

Any sequence matching any of the above criteria was excluded in its entirety.

Sequences were mapped to a reference (bases 266:29674 of GenBank entry NC_045512; i.e., the first base of the ORF1ab start codon to the last base of the ORF10 stop codon) using "nucmer" from the MUMmer package (version 3.23; (Kurtz et al., 2004) ). The nucmer output "delta" file was parsed directly using custom Perl code to partition sequences into the various exclusion categories (Sequence Mapping Table) and to construct a multiple sequence alignment (MSA). The MSA was refined using code derived from the Los Alamos HIV database "Gene Cutter" tool code base. At this stage, alignment columns comprising an insertion of a single "N" in a single sequence (generating a frame-shift) were deleted, and gaps were shifted to conform with codon boundaries.

Using the initial "good-sequence" alignment, a low-effort parsimony tree was constructed. Initially, trees were built using PAUP* (Swofford, 2003) with a single replicate heuristic search using stepwise random sequence addition; subsequently, a parsimony ratchet was added; currently, oblong (Goloboff, 2014) is used. Sequences in the alignment were sorted vertically to correspond to the (ladderized) tree, and reference-sequence reading frames were added. See Fig. S1 for a pipeline schematic.

Alignments were made and trees inferred for three distinct data partitions, the longer the alignments, the fewer sequences the sequences (Fig. S1 .) The full genome tree was used for Fig. 7 . Trees were inferred by either of two methods: 1. neighbor-joining using a p-distance criterion, (Swofford, 2003) or 2. parsimony heuristic search using a version of the parsimony ratchet (Goloboff, 2014) , the general conclusions in Fig. 7 were substantiated in both; the parsimony tree is shown.

The Covid-19 pie chart map is generated by overlaying Leaflet (a JavaScript library for interactive maps) pie charts on maps provided by OpenStreetMap. The interface is presented using rocker/shiny, a Docker for Shiny Server.

To observe a significant change in the frequency of two SARS-CoV-2 variants in a geographic region, three minimal requirements must be met. Both variants must have been introduced into an area and be co-circulating, data must be sampled for a long enough period to observe a change in frequency, and there must be enough data to be powered adequately to detect a difference.

We use the bioinformatic approaches described above to extract from GISAID all the politically defined geographic regions within the data that met these criteria, to track changes in frequency in a systematic way using all available data. The political/geographical regions we use are strictly hierarchically segmented based on the naming conventions used in GISAID. GISAID data is labeled such that the geographic source is noted first as a continent or Oceania; we call this Level 1. Level 2 is the country of origin of a sample. Level 3 are subcountries and states, and although occasionally level 3 includes a major city in a small country. For this purpose, England, Scotland, and Wales are considered sub-countries of the United Kingdom, and assigned level 3; the sampling in the UK has been the most extensive globally to date. Level 4, is the county or city of origin. The levels are strictly hierarchical, and within a given level, the geographical regions do not overlap. In some cases (e.g., Nepal_Kathmandu and Nepal, Greece_Athens and Greece, Italy_Veneto_Verona and Italy_Veneto, or Iceland_Reykjavik and Iceland) the sampling in a sub-level exactly matches the sampling in the corresponding upper level, in which case the sub-level is not presented. Levels 3 and 4 are not always available, and the day of sampling is also not always available.

The statistical strategies we use are then applied separately in each country, region or city, and we do not assume that outbreaks in each political subdivision are independent and identically distributed. Instead, our model assumption is that the individuals we test within a region are independent. This assumption may fail if there are sampling biases in a region that change over a given period of time. The G614 form is part of the G clade haplotype that is introduced by travelers, as we discuss in the text, and it is rare for it to arise independently. Our null hypothesis is that the observed shifts in frequency are random nondirectional drift. We have taken two statistical approaches to test this.

For this comparison, we used a two-sided Fisher's exact test to compare the G614 and D614 counts in the pre-onset and the post-delay periods, as described in the text, and provides a p-value against the null hypothesis that the fraction of D614 and G614 sequences did not change. To be included in the analysis, 15 sequences were required pre-onset, with a mixture of D614 and G614 present such that the rarer form was present at least 3 times; we also required a minimum of 15 sequences be sample at least 2 weeks later, to create a post-delay set. Only regions for which p<0.05 are considered, based on a two sided-test. We then use a binomial test to evaluate the null hypothesis that in regions where we saw significant change in sampling frequency over time, the shift was as likely to be an increase or a decrease in G614 across geographic regions. This analysis is presented in Fig. 1B .

Isotonic regression forms the basis of a one-sided test of the hypothesis for positive selection based on fitting the indicator that the typed strain is G as a logistic regression in which the logarithm of the odds ratio is a non-decreasing function of time. We use the residual deviance of the fitted model as our test statistics. To be included in this analysis, a region was required to have at least 5 sequences each of D614 and G614, and a minimum of 14 sampling days of data available. While we have a composite null hypothesis (the log-odds ratio is non-increasing), assuming that the log-odds ratio remains constant over time leads to tests that have largest power. While the classical chi-square approximation does not hold, we can sample from the constant log-odds ratio by permuting the vector of variant labels, and refitting the isotonic logistic regression. We performed 400 randomizations of the data in each region. Hence the lowest p-value we can obtain is 0.0025. The reverse hypothesis, namely than the fraction of G variant decreases with time is also tested by fitting a non-increasing function of time. The isotonic logistic regression was done using R and the cgam package. We applied the bionomial test across regions with a significant change in one direction, as we did for the Fishers test results. This analysis is presented in Fig 3 and 

Two PCR Ct methods were used as a surrogate for estimating in vivo viral load in the upper respiratory tract, switching methods in mid-April due a shortage of kits. The first method involved nucleic acid extraction; the second method, heat treatment (Fomsgaard and Rosenstierne, 2020) .

To assess the impact of available clinical parameters on viral load as measured by PCR Cts, we used a linear model, predicting Ct from PCR method, Sex, Age and D614G variant. This revealed that only the PCR method and the D614G variant were statistically significant. A negative coefficient for the G variant indicated that patients infected by the latter have, on average, a higher viral load, but that that viral load is not impacted by neither age nor sex.

The results from the smaller model are: Results comparing D614G status for the two methods were also evaluated independently, and the first method showed a significant association between lower Ct values and presence of G614 (Wilcoxon p = 0.033), but the second method, with many fewer samples, did not reach significance.

The simple Fisher's exact test analysis in Fig. 5 indicates that the D614G status is not predictive of hospitalization, even though it is predictive of viral load. We can make a first analysis to predict hospitalization from viral load, gender, age and D614G status: As somewhat expected, the D614G status is not statistically significant, even though viral load is, but the coefficient goes in the opposite direction than we would have intuited: a lower viral load is predictive of a higher probability of hospitalization. Sex (Male) and Age both increase the probability of hospitalization.

Predicting Hospitalization, revisited

Although the above analysis indicates that aa614G does not predict hospitalization directly, it does predict viral load and viral load predicts hospitalization; so there is a concern that aa614G might affect hospitalization, but that this effect is "masked" by the viral load. To explore this hypothesis, we "unmask" the aa614G by using the residuals from the regression of Ct on extraction method and D614G status to get a second predictive model for hospitalization: In these regression analyses, the estimated coefficients for age, sex and viral load (corrected or not for method and strain) remain mostly unchanged, and strain still does not have an effect.

All other comparisons were not significant. All coding was done using R. Results of these analysis are presented in the main text and in Fig. 5 .

We used a log-normal generalized linear model (GLM) to test whether the G614 variant grew to higher titers than the wildtype D614 virus in Vero, 293T-ACE2 and 293T-ACE2-TMPRSS2 cell lines. The full experiment was repeated twice, each time in triplicate, and the 2 experimental repeats were considered random effects. Viral variant and cell line were considered as fixed effects. On average, across all cell lines, G614 grows to about a 3-fold (2.95) higher titer than D614 (p=9x10 -11 ). A significant interaction was found between viral variant and cell line (p=0.002), indicating that the relative increase of G614 compared to D614 was significantly different across cell lines (p=0.002).

Results of these analysis are presented in Fig. 6A .

We discovered a sequencing processing error that gave rise to what appeared at first to be a mutation of interest at position 943 (24389 A>C and 24390 C>G) in Spike that was evident in sequences from Belgium. It was frequent enough to be a site of interest, and was tracked. We contacted the group in Belgium, the source of the data, who were already aware of the issue, concurred with our interpretation, and they had been in touch with GISAID with a request to remove the problematic sequences.

We identified the issue with this site as part of another study using a method to detect systematic sequencing errors (Freeman et al., 2020) ; we are interrogating the quality of available sequencing data and these positions were highlighted as suspect. We interrogated these positions in the raw sequencing data from Sheffield, and although these two variants are not present in the final consensus sequence from any of the Sheffield isolates, the raw, untrimmed bam files show their presence in only one of the amplicons covering the site (Fig. S7 A&B) . We noticed that in fact this position is to the left of the 5' primer of amplicon 81 in what we believe to be an adapter sequence. Comparison of the Wuhan reference and the adapter sequence reveals similarity around this position:

Nanopore adapter sequence: CAGCACCTT

The Wuhan reference sequence: CAGCAAGTT In our validation set, we see a C present at around 50% of called bases at both these positions in raw data but this region is trimmed by the ARTIC pipeline and is therefore not used to call variants and contribute to the final consensus sequence. Although it is evident in amplicon 81, in this region, there is no evidence for these variants in the data from amplicon 80, which also covers these positions. We include a figure (Fig. S7 ) to explain our finding.

In summary this is an error that has arisen due to a combination of improper trimming of adapter and primer regions from raw sequencing reads before downstream analysis, and the coincidental homology between the nanopore adapter sequence and the Wuhan reference genome in this region. This is included here as a cautionary note; resolving rare biological mutations and sequencing error will be an important balance going forward in terms of interpretation of rare mutations (De Maio et al., 2020) . A recurrent amino acid change like L5F (Fig. 7) could potentially result from a recurrent sequencing or sequence processing error (De Maio et al., 2020) , or alternatively, it may be of particular interest if it is naturally recurring homoplasy.

Current data updates, analytical results, and webtools: cov.lanl.gov Table S1 . GISAID acknowledgments, Related to Figures 1-3, and 7 . Table S2 can be found at https://cov.lanl.gov.

Data S1. Modeling the daily fraction of the G614 variant as a function of time in local regions using isotonic regression, Related to Figure 3.

-A SARS-CoV-2 variant with Spike G614 has replaced D614 as the dominant pandemic form -The consistent increase of G614 at regional levels may indicate a fitness advantage -G614 is associated with lower RT PCR Ct's, suggestive of higher viral loads in patients -The G614 variant grows to higher titers as pseudotyped virions 

Notes: 1. Multiply occurring identical sequences are reduced to 2 occurrences to so that parsimony-informative sites do not become unique. 2. "Coding regions" subset includes sequences passing error filtering, bounded from the orf1ab start codon to the ORF10 stop codon (NC_045512 genome positions 266-29674). 

A signature in HIV-1 envelope leader peptide associated with transition from acute to chronic infection impacts envelope processing and infectivity

SARS-CoV-2 viral spike G614 mutation exhibits higher case fatality rate

Epidemic dynamics and antigenic evolution in a single season of influenza A

HIV-1 Neutralizing Antibody Signatures and Application to Epitope-Targeted Vaccine Design

Potential for developing a SARS-CoV receptorbinding domain (RBD) recombinant protein as a heterologous human vaccine against coronavirus infectious disease (COVID)-19

Analysis of human coronavirus 229E spike and nucleoprotein genes demonstrates genetic drift between chronologically distinct strains

COVID-19 shot protects monkeys

Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR

HomoplasyFinder: a simple tool to identify homoplasies on a phylogeny

Origin and evolution of pathogenic coronaviruses

Motifs of serine and threonine can drive association of transmembrane helices

SARS and MERS: recent insights into emerging coronaviruses

Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California

Demographic science aids in understanding the spread and fatality rates of COVID-19

Data, disease and diplomacy: GISAID's innovative contribution to global health

Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States

An alternative workflow for molecular detection of SARS-CoV-2 -escape from the NA extraction kit-shortage

Genomic loci susceptible to systematic sequencing bias in clinical whole genomes

Cooperation of B cell lineages in induction of HIV-1-broadly neutralizing antibodies

Oblong, a program to analyse phylogenomic data sets with millions of characters, requiring negligible amounts of RAM

The species Severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2

Recombination, Reservoirs, and the Modular Spike: Mechanisms of Coronavirus Cross-Species Transmission

Isolation and Characterization of Viruses Related to the SARS Coronavirus from Animals in Southern China

Spread of SARS-CoV-2 in the Icelandic Population

Nextstrain: real-time tracking of pathogen evolution

Neutralizing antibody response and SARS severity

SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor

SARS-CoV-2 Cell Entry Depends on ACE2 and TMPRSS2 and Is Blocked by a Clinically Proven Protease Inhibitor

SARS-CoV-2 Reverse Genetics Reveals a Variable Infection Gradient in the Respiratory Tract

Matplotlib: A 2D Graphics Environment

Anti-severe acute respiratory syndrome coronavirus spike antibodies trigger infection of human immune cells via a pH-and cysteine protease-independent FcγR pathway

Functional analysis of potential cleavage sites in the MERS-coronavirus spike protein

Versatile and open software for comparing large genomes

Neutralization-guided design of HIV-1 envelope trimers with high affinity for the unmutated common ancestor of CH235 lineage CD4bs broadly neutralizing antibodies

Envelope residue 375 substitutions in simian-human immunodeficiency viruses enhance CD4 binding and replication in rhesus macaques

Two-year prospective study of the humoral immune response of patients with severe acute respiratory syndrome

Clinical and biochemical indexes from 2019-nCoV infected patients linked to viral loads and lung injury

A Unique Clade of SARS-CoV-2 Viruses is Associated with Lower Viral Loads in Patient Upper Airways. medRxiv

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

Coronavirus disease (COVID-19): a scoping review

Enhanced isolation of SARS-CoV-2 by TMPRSS2-expressing cells

Host cell entry of Middle East respiratory syndrome coronavirus after two-step, furin-mediated activation of the spike protein

Efficient transfer, integration, and sustained long-term expression of the transgene in adult rat brains injected with a lentiviral vector

Proteolytic processing of Middle East respiratory syndrome coronavirus spikes expands virus tropism

A geroscience perspective on COVID-19 mortality

Genetic drift of human coronavirus OC43 spike gene during adaptive evolution

Emergence of gp120 V3 variants confers neutralization resistance in an R5 simian-human immunodeficiency virus-infected macaque elite neutralizer that targets the N332 glycan of the human immunodeficiency virus type 1 envelope glycoprotein

Characterizing the Murine Leukemia Virus Envelope Glycoprotein Membrane-Spanning Domain for Its Roles in Interface Alignment and Fusogenicity

Maximum Daily Temperature, Precipitation, Ultra-Violet Light and Rates of Transmission of SARS-Cov-2 in the United States

Insights into RNA synthesis, capping, and proofreading mechanisms of SARS-coronavirus

GISAID: Global initiative on sharing all influenza data -from vision to reality

A transmembrane serine protease is linked to the severe acute respiratory syndrome coronavirus receptor and activates virus entry

Coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics

Cross-host evolution of severe acute respiratory syndrome coronavirus in palm civet and human

Broadening of neutralization activity to directly block a dominant antibodydriven SARS-coronavirus evolution pathway

PAUP*. Phylogenetic Analysis Using Parsimony (*and Other Methods), Version 4

Identification of human neutralizing antibodies against MERS-CoV and their role in virus adaptive evolution

Longitudinally profiling neutralizing antibody response to SARS coronavirus with pseudotypes

Human monoclonal antibody combination against SARS coronavirus: synergy and coverage of escape mutants

Circulation of genetically distinct contemporary human coronavirus OC43 strains

Comparing viral load and clinical outcomes in Washington State across D614G mutation

Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein

Molecular Mechanism for Antibody-Dependent Enhancement of Coronavirus Entry

The establishment of reference sequence for SARS-CoV-2 and variation analysis

Epitopes in Humans Elicited both Enhancing and Neutralizing Effects on Infection in Non-human Primates

Antibody-dependent SARS coronavirus infection is mediated by antibodies against spike proteins

Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation

Genome Composition and Divergence of the Novel Coronavirus (2019-nCoV) Originating in China

Inhibition of SARS-CoV-2 (previously 2019-nCoV) infection by a highly potent pan-coronavirus fusion inhibitor targeting its spike protein that harbors a high capacity to mediate membrane fusion

Structural basis for the recognition of SARS-CoV-2 by full-length human ACE2

Antibody-dependent enhancement of SARS coronavirus infection and its role in the pathogenesis of SARS

DNA vaccine protection against SARS-CoV-2 in rhesus macaques

A highly conserved cryptic epitope in the receptor-binding domains of SARS-CoV-2 and SARS-CoV

TMPRSS2 and TMPRSS4 promote SARS-CoV-2 infection of human small intestinal enterocytes

The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity. bioRxiv

Antibody responses against SARS coronavirus are correlated with disease outcome of infected individuals

A pneumonia outbreak associated with a new coronavirus of probable bat origin

We thank Andrew McMichael, Sarah Rowland-Jones, and Xiao-Ning Xu for bringing together the clinical and theory teams. We thank Anthony West for pointing out the 5' UTR G-clade mutation, George Ellison for suggestions on clinical data analyses; Barbara Imperiali for insights regarding the structural implications of the D614G change; and Rachael Mansbach, Srirupa Chakraborty and Kien