key: cord-0993911-5gilxw36
authors: Wei, Yulong; Silke, Jordan R.; Aris, Parisa; Xia, Xuhua
title: Coronavirus genomes carry the signatures of their habitats
date: 2020-12-22
journal: PLoS One
DOI: 10.1371/journal.pone.0244025
sha: 0f89213794f5052b6b533cd5e669bc736e764ca7
doc_id: 993911
cord_uid: 5gilxw36

Coronaviruses such as SARS-CoV-2 regularly infect host tissues that express antiviral proteins (AVPs) in abundance. Understanding how they evolve to adapt or evade host immune responses is important in the effort to control the spread of infection. Two AVPs that may shape viral genomes are the zinc finger antiviral protein (ZAP) and the apolipoprotein B mRNA editing enzyme-catalytic polypeptide-like 3 (APOBEC3). The former binds to CpG dinucleotides to facilitate the degradation of viral transcripts while the latter frequently deaminates C into U residues which could generate notable viral sequence variations. We tested the hypothesis that both APOBEC3 and ZAP impose selective pressures that shape the genome of an infecting coronavirus. Our investigation considered a comprehensive number of publicly available genomes for seven coronaviruses (SARS-CoV-2, SARS-CoV, and MERS infecting Homo sapiens, Bovine CoV infecting Bos taurus, MHV infecting Mus musculus, HEV infecting Sus scrofa, and CRCoV infecting Canis lupus familiaris). We show that coronaviruses that regularly infect tissues with abundant AVPs have CpG-deficient and U-rich genomes; whereas those that do not infect tissues with abundant AVPs do not share these sequence hallmarks. Among the coronaviruses surveyed herein, CpG is most deficient in SARS-CoV-2 and a temporal analysis showed a marked increase in C to U mutations over four months of SARS-CoV-2 genome evolution. Furthermore, the preferred motifs in which these C to U mutations occur are the same as those subjected to APOBEC3 editing in HIV-1. These results suggest that both ZAP and APOBEC3 shape the SARS-CoV-2 genome: ZAP imposes a strong CpG avoidance, and APOBEC3 constantly edits C to U. Evolutionary pressures exerted by host immune systems onto viral genomes may motivate novel strategies for SARS-CoV-2 vaccine development.

The COVID-19 pandemic is a serious global health emergency. Understanding how coronaviruses adapt or evade tissue-specific host immune responses is important in the effort to control the spread of infection and to facilitate vaccine-development strategies. As obligate parasites, coronaviruses evolve in mammalian hosts and carry genomic signatures shaped by their host- unclear whether APOBEC3 expression in tissue cells plays a role in restricting viral infections, APOBEC3 expression has been observed to vary at the tissue level [21, 22] .

APOBEC3 enzymes have negligible catalytic activity prior to infection, but expressions of associated mRNAs are induced in response to cell stresses including hypoxia, cell crowding, and presence of interferon-α [21, 25, 26] . Through a mechanism largely derived from HIV-1 studies, APOBEC3 enzymes have been prominently reported to restrict viral infectivity [20, [27] [28] [29] [30] by editing C to T at the viral genomes, and HIV-1 avoids this deleterious effect by expressing Vif to target and degrade APOBEC3 enzymes [31, 32] . Additionally, C to U RNA-editing has been demonstrated in lymphocytes, macrophages, monocytes, and natural killer cells by both A3A and A3G in response to hypoxia and interferons [25, 26, 33] , and recent studies propose that APOBEC3 enzymes may act directly to edit single-stranded RNA coronaviruses [34] [35] [36] . This notion is supported by lines of evidence showing that A3C, A3F, and A3H may inhibit HCoV-NL63 coronavirus infection in humans [37] . Indeed, many C to U mutations have been detected in SARS-CoV-2 genomes [34] [35] [36] 38] , and if RNA-editing by APOBEC3 is involved, then this immune response could potentially restrict SARS-CoV-2 because coronaviruses do not encode a Vif analogue to degrade APOBEC3 enzymes.

Nonetheless, APOBEC3 enzymes prefer to edit C in specific motif contexts. For instance, in HIV-1 [39, 40] , MLV [41] [42] [43] , and SIV [44] , A3G tends to deaminate C mostly in the context of 5' CC (underlined is site subjected to C to T editing), whereas all other APOBEC3 paralogues deaminate C in the context of 5' TC [40, [45] [46] [47] . In HIV, these edits cause 5' GG to 5' AG and 5' GA to 5' AA hypermutations on the positive DNA strand [39, 40] to potentially disrupt protein function [48, 49] . However, not all 5' TC and 5' CC are deaminated with equal efficiency because the identities of the -2 and +1 nucleotides flanking the 5'NC are important in APOBEC3 target selection [41, 50-53].

In addition to motif preference, the structural configuration of the substrates bound to the APOBEC3 zinc center may also influence APOBEC3 editing activity [54, 55] . Adding to this complexity, APOBEC3-mediated editing studies have reached dissimilar conclusions as to the optimal secondary structure of the 5' TC target. For instance, a large number of A3A and A3G RNA editing substrates were predicted to form a loop structure in innate immune cells and HEK293T cells [25, 33] . In a further mutagenesis study on three A3A editing RNA substrates, SDHB, APP, and TMEM109, and on one A3G editing RNA substrate PRPSAP2, Sharma and Baysal [56] found that both A3A and A3G enzymes highly preferred to edit the respective 5' TC and 5' CC targets that resided within a 4nt-loop in a stem-loop structure, with C located at the 3' end of the loop followed by a +1G located at the 5' end of the stem. In this structural context, changing the substrate at the -1 and +1 nucleotides greatly reduced A3A and A3G editing activities [56] . Nonetheless, a limitation of the study is that only four APOBEC3 editing substrates were examined and all were in the context of 5' N(C/T)CG.

Another study [57] showed that A3G could also efficiently edit 5' ACCA, 5' CCCC, and 5' TCCT, but not 5' GCCG, in ssDNA oligonucleotides when these targets were in an open (unstructured) configuration. Furthermore, A3G poorly edits 5' ACCA and 5' CCCC targets when they are located in short loops (<7 nt and <6 nt, respectively). Summarily, A3G prefers to edit 5' CC targets within a loop region but only in the context of 5' NCCG.

A third study [52] analyzed the in vitro editing activities of all seven APOBEC3 enzymes on ssDNA oligonucleotides embedding 5' NTCN motifs, where the 5' TC targets were located within loop, stem, or open structures. McDaniel et al. [52] found that the editing activities of all APOBEC3 enzymes, except A3F and A3H, were the highest when 5' TC in the context of 5' GTCG was located within a loop region. However, all seven APOBEC3 enzymes also had the lowest editing activities at 5' GTCG in comparison to other 5' NTCN motifs, and the editing activities of all APOBEC3 enzymes were the highest when 5' ATCA, 5' GTCA, 5' CTCA, and 5' CTCT were located in an open structure. In general, APOBEC3-edited 5' (C/T)C targets in the context of 5' N(C/T)CG prefer a loop region, but 5' TC targets in the context of 5' NTC (A/T) prefer an open structure.

The above observations allow for the formation of the hypothesis that APOBEC3 and ZAP exert selective pressure on coronavirus genomes. To test this, a variety of mammalian hosts and tissues should be considered because there may be both species-specific and tissue-specific differences in ZAP and APOBEC3 productions. If the short-lived mice should produce less ZAP and APOBEC3 than long-living mammals, then the mouse hepatitis virus (MHV) should also experience little selection targeting CpG by ZAP and C to U editing by APOBEC3 in comparison to other mammalian-specific coronaviruses. Moreover, coronaviruses regularly infect organ tissues exposed to the external environments such as the respiratory and digestive systems [58, 59] . We expect that if a coronavirus regularly infects host tissues that are abundant in ZAP, then its genome should display CpG deficiency in CG-rich motifs such as CN X GNCG, to elude ZAP-mediated immune response. If, in addition, the regularly infected tissue is abundant in APOBEC3, then the viral genome should trend prevalently towards increased C to U mutations in the context of APOBEC3-preferred motifs. Conversely, if a species-specific coronavirus regularly infects host tissues that are deficient in either APOBEC3 or ZAP, there will be either no strong CpG deficiency or elevated U and decreased C contents, as these selective pressures will be weak.

Our investigation considered a comprehensive number of publicly available genomes for seven coronaviruses (the Betacoronaviruses: SARS-CoV-2, SARS-CoV, MERS, Bovine CoV, mouse hepatitis virus [MHV] , and porcine hepatitis E virus [HEV] , and the Alphacoronavirus canine respiratory coronavirus [CRCoV]) as well as studies with tissue-level ZAP and APO-BEC3 mRNA expressions in the five host species (human, cattle, dog, mice, and pig). We found that all surveyed coronaviruses, except MHV, regularly infect host tissues with high ZAP and APOBEC3 mRNA expressions. Expectedly, all surveyed coronavirus genomes except MHV are strongly CpG deficient. In addition, deficiency of CpG was detected in the context of ZAP-preferred motifs in SARS-CoV-2. Furthermore, a temporal and geographical analysis for single nucleotide polymorphisms (SNPs) in local SARS-CoV-2 regions showed that the occurrence of C to U mutations was strikingly more prevalent than other SNPs. The preferred motif and structural contexts of 5' UC to 5' UU mutations were consistent with those favorably edited by APOBEC3 enzymes, but 5' CC to 5' CU mutations were weakly explained by APO-BEC3G editing preference. The genome compositions of viruses are subjected to adaptation when the virus regularly infects tissues expressing ZAP and APOBEC3 in abundance, but not when a virus infects tissues that lowly express these AVPs.

The NCBI Nucleotide Database was queried for "APOBEC3" and "ZC3HAV1" as gene names, and "Homo sapiens", "Bos taurus", "Canis lupus familiaris", "Mus musculus", and "Sus scrofa" as species, and protein coding sequences of APOBEC3 and ZC3HAV1 isoforms were extracted in FASTA format along with their Ensembl Accession IDs.

To compare mRNA expressions of APOBEC3 and ZC3HAV1 (ZAP) among tissues, we retrieved publicly available RNA Sequencing and Microarray studies that each sampled total RNA in at least ten mammalian tissues (see S1 File). The five mammalian species that have extensive tissue-specific mRNA expressions are Homo sapiens (human), Bos Taurus (cattle), Canis lupus familiaris (dog), Mus musculus (mice), and Sus scrofa (pig). For Homo sapiens, tissue-level mRNA expressions were retrieved in averaged FPKM values from all 171 RNA-Seq datasets in BioProject PRJEB4337 [3] , 48 RNA-Seq datasets in BioProject PRJEB2445, 20 RNA-Seq datasets in BioProject PRJNA280600 [60] , and in median TPM values from all RNA-Seq datasets available in the GTEx Portal [61] . For Mus musculus, tissue-level mRNA expressions were retrieved in averaged FPKM values from all 741 RNA-Seq datasets in BioProject PRJNA66167 (mouse ENCODE consortium) [62] and in average TPM values from all 79 RNA-Seq datasets in BioProject PRJNA516470 [63] . For Sus scrofa, tissue-level mRNA expressions were retrieved in averaged FPKM values from TISSUE 2.0 integrated datasets [64] . For Canis lupus familiaris, tissue-level mRNA expressions were retrieved in averaged fluorescence intensity units (FIU) from all 39 microarray datasets in BioProject PRJNA124245 [65] , and in averaged TPM values from all 75 RNA-Seq datasets in BioProject PRJNA516470 [63] . Lastly, for Bos taurus, tissue-level mRNA expressions were retrieved in averaged FPKM values from 42 RNA-Seq datasets in the Bovine Genome Database [66] . All selected studies have considered total RNA at the tissue level in healthy individuals, but they do not report cell-specific mRNA expressions within most tissues (e.g., lung, liver, small intestine).

Given that mRNA expressions were extracted were from multiple independent sources (some reporting FPKM and others TPM) and thus not directly comparable between studies, we calculated the relative mRNA expression levels of APOBEC3 and ZAP isoforms among tissues in each independent source. Specifically, we calculated the proportion of mRNA expression (PME) as: PME ¼ mRNA expression value in a specific tissue summed mRNA expression values in all tissues ð1Þ

To show that PME determines the relative mRNA expressions of a gene among tissues, we calculated the PME values for all 13 human genes that were determined to have the highest mRNA expressions in the lungs (marked as Tissue enriched) by The Human Protein Atlas database [67] (https://www.proteinatlas.org/humanproteome/tissue/lungs). They are SFTPC, SFTPA2, SFTPA1, SCGB1A1, SFTPB, AGER, SCGB3A2, SFTA2, CACNA2D2, LAMP3, SFTPD, HTR3C, RTKN2. If PME works as intended, then the PME values of these 13 genes should be high in the lung tissue in comparison to 52 other tissues reported in the GTEx database [61] . As expected, we found that 12 out of these 13 genes have the highest PME values in the lungs, but CACNA2D2 has the highest PME value in the cerebellum and second highest PME values in the lungs (see S1 File). This is not unexpected because based on the BRAIN ATLAS [67] , the mRNA expression of CACNA2D2 is also enhanced in the cerebellum. PME values were calculated from averaged TPM values in 24 human tissues using all RNA-Seq datasets available in the GTEx Portal [61] , from averaged FPKM values in 26 cattle tissues using the Bovine Genome Database [66] , from averaged FPKM values in 33 pig tissues using TISSUE 2.0 integrated datasets [64] , from averaged FPKM values in 17 mice tissues using all 741 RNA-Seq datasets in mouse ENCODE consortium [62] , from averaged FPKM values in 12 mice tissues using 79 RNA-Seq datasets in BioProject PRJNA516470 [63] , and from averaged fluorescence intensity units in 10 dog tissues using all 39 microarray datasets in BioProject PRJNA124245 [65] . For each AVP isoform, tissue-specific PMEs were designated as high if they are greater than averaged PME and low if they are less than averaged PME (see S1 File).

Processing and quantifying transcriptomic data from chimeric human lung-only mice to obtain AVP mRNA expressions in control vs.

Transcriptomic data associated with a study exploring SARS-CoV-2 infection in chimeric human lung-only mice (LoM) (GSE155286) were retrieved from NCBI's Sequence Read Archive (SRA) database and a summary of the data collected is detailed in Table 1 .

The data were first partitioned into gzipped forward and reverse read fastq files using fastqdump from the NCBI SRA toolkit (version 2.10.8). The resulting fastq.gz files were trimmed for Illumina TruSeq3 adapters and reads averaging a phred quality score < 20 were discarded using trimmomatic version 0.39 [68] . All surviving pairs from preprocessing were carried forward to quantification using kallisto (version 0.46.1) [69] .

An index file for the human transcriptome was generated from the Ensembl FASTA reference file "Homo_sapiens.GRCh38.cdna.all.fa" containing all human cDNAs with Ensembl transcript IDs [70] using kallisto's index function. The resulting index was used to quantify transcript abundances using kallisto for each experiment detailed in Table 1 , and 1000 bootstrap samples were computed for each experiment to act as a proxy for technical replicates during subsequent analysis using sleuth (version 0.30) [71] . The kallisto outputs, including bootstrapped values from the previous step, were processed using the sleuth R package. Ensembl transcript IDs were associated with their Ensembl Gene ID and gene name using the biomaRt R package and a sleuth object was prepared using the Ensembl gene ID for aggregation. The sleuth object was then fitted with two models: a full model (alternative) that assumes transcript abundance varies based on the time after SARS-CoV-2 infection, and a reduced model (null) assuming that transcript abundance varies between samples. The two models were compared with the likelihood-ratio test, and the resulting transcript level p-values were then aggregated based on their associated Ensembl gene ID using Lancaster's method, which assigns weights based on transcript abundance. A Benjamini-Hochberg false discovery rate correction [72] was then applied to the weighted p-values to account for multiple comparisons [73] . AVP genes were then differentially assessed at the level of their corresponding transcripts and comparisons were drawn from heat maps using natural log transformed TPM values with a 0.5 offset generated by sleuth. Only transcripts of interest with an Ensembl Biotype of "Protein coding" that demonstrated variations in expression levels were considered in subsequent analyses. All significantly differentially expressed genes between control and infected samples, with false discovery rate q < 0.05, are listed in S1 File.

Host tissues that are infected by SARS-CoV-2, SARS-CoV, and MERS in humans, Bovine CoV in cattle, CRCoV in dogs, MHV in mice, and HEV in pigs were identified through an exhaustive large-scale manual search for experimental evidence-based primary source studies published up until June 5, 2020. Only studies that showed results from clinical course, autopsy, and experimental infections were considered, but cross-host studies were excluded. In total, tissue infections were determined from 25 SARS-CoV studies, 11 SARS-CoV-2 studies, eight MERS studies, 15 mouse hepatitis virus (MHV) studies, nine porcine hepatitis E virus (HEV) studies, 18 canine respiratory coronavirus (CRCoV) studies, and ten bovine coronavirus (Bovine CoV) studies (see references in S1 File). Next, the regular tissue habitats of viruses were determined based on commonness of viral detection in host tissues when all studies were considered. For example, among the 25 SARS-CoV-2 studies collected, some tissue infections (e.g., lungs and intestines) are recorded in many studies while other tissue infections are rarely recorded (e.g., stomach). To score the commonness of SARS-CoV-2 infection in a tissue, in the lungs for instance, we calculated commonness of detection (COD) as:

Note that the COD measurement should not be used to make specific comparisons to rank most to least regularly infected tissues, because manually curated study size biases COD measurements. However, COD does tell us which tissues were commonly infected by a virus. For example, among the 25 SARS-CoV-2 studies collected, viral detection was reported in the lungs in nine studies, the intestines in eight studies, the liver in four studies, the heart in three studies, the kidney in three studies, and the stomach in one study (S1 File). The COD values for the lungs and the intestines are therefore the highest. Hence, the lungs and the intestines are surely regular habitats of SARS-CoV-2. Similarly, we determined the regular habitat for SARS-CoV (human lungs), MERS (human lungs), MHV (mice brain), HEV (pig liver), CRCoV (dog intestines and lungs), and Bovine CoV (cattle intestines and lungs). These regular habitats have COD values higher than twice that of any other tissue, except dog lungs for CRCoV, whose COD value was at least 1.5 times that of other tissues.

The genome, Accession ID, and Sample Collection Date of 28475 SARS-CoV-2 strains were retrieved from the China National Center for Bioinformation (CNCB) (https://bigd.big.ac.cn/ ncov/variation/statistics?lang=en, last accessed May 16, 2020), among which 2666 strains were selected because they were annotated as having complete genome sequences and high sequencing quality. Additionally, the complete genomic sequences of 403 MERS strains, 134 SARS-CoV strains, 20 Bovine CoV strains, two CRCoV strains, 26 MHV strains, and ten HEV strains were downloaded from the National Center for Biotechnology Information (NCBI) Nucleotide Database (https://www.ncbi.nlm.nih.gov/) (see S2 File).

We computed the nucleotide and di-nucleotide frequencies in each viral genome. Among strains, some have long poly-A tails that are missing in others. Some also have a longer 5' untranslated region (5' UTR) than others. To make a fair comparison between strains, genomes were first aligned with MAFFT version 7 [74] , with the slow but accurate G-INS-1 option for 134 SARS-CoV, 20 Bovine CoV, two CRCoV, 26 MHV, and ten HEV strains, and with the fast FFT-NS-2 option for large alignments for 2666 SARS-CoV-2 and 403 MERS strains. Next, using DAMBE version 7 [75] , the 5' UTR sequences were trimmed away until the first fully conserved nucleotide position, and the 3' UTR sequences were trimmed out up to the last fully conserved nucleotide position. Then, gaps were removed from each trimmed genome, and the global nucleotide and dinucleotide frequencies were computed in DAMBE under "Seq. Analysis|Nucelotide & di-nuc Frequency" (see S2 File). Additionally, nucleotide and di-nucleotide frequencies were similarly computed for whole, untrimmed, genomes (see S3 File). Finally, the conventional index of CpG deficiency (I CpG ) [76, 77] was calculated, using the formula below:

Where P CG is the proportion of CG dinucleotides when all dinucleotide frequencies were considered, and P C and P G are proportions of C and G nucleotides, respectively. The index is expected to be proximal to 1 when CpG is not deficient or in excess, smaller than 1 if CpG is deficient and greater than 1 if CpG is in excess.

Among the 2666 SARS-CoV-2 genomes from CNCB (database last accessed on May 16, 2020), we randomly selected one genome at each unique collection date, inclusively between December 31, 2019 (Wuhan-Hu-1, first isolate) and May 6, 2020 (mink/NED/NB04), among those that have complete records of local region annotations and nucleotide sequences in NCBI (see S4 File). A total of 99 strains were retrieved across 127 days since SARS-CoV-2 (including strain Wuhan-Hu-1, MN908947) was first sequenced. For each of these 99 strains, the nucleotide sequence of 12 out of 13 viral regions (5' UTR, ORF1ab, S, ORF3, E, M, ORF6, ORF7a, ORF8, N, ORF10, and 3' UTR) were extracted from DAMBE in FASTA format, MAFFT aligned with the slow but accurate G-INS-1 option, and local nucleotide and dinucleotide frequencies were computed for each region (see S5 File). ORF7b was omitted from the analysis because it was not annotated in 30 out of 99 strains, including the reference genome Wuhan-Hu-1 (MN908947).

To determine the nucleotide mutation patterns over time at each viral region, each aligned sequence was grouped into one of six time ranges, and the time range within each group was determined as the number of days passed since the reference strain (Wuhan-Hu-1, 2019-12-31). Note that the number of days between time intervals is unequal, because strains were grouped based on roughly equal sample size and not by equal number of days. Then, sequences within time groups were pair-wise assessed for single nucleotide polymorphisms (SNPs) using DAMBE's "Seq. Analysis|Nucleotide substitution pattern" with reference genome = Wuhan-Hu-1 (MN908947) and Default genetic distance = F84, and the sum of SNPs within each group was calculated (see S4 File).

To control for any confounding effects imposed by mutations that could arise in specific geographic areas, we repeated the above analysis for all high quality and complete genomes in a country-specific manner. Only three countries have sequenced large numbers of strains with unique collection dates, leading us to consider 80 strains from the United States, 39 strains from Australia, and 34 strains from China (see S4 File). Note that because sample collection dates vary from one country to another, the time intervals will differ among geographical locations. In addition, within a geographical location, the sample sizes and time intervals may differ slightly among viral regions because not all strains have complete annotations for every viral region. For example, all 34 strains from China have an annotated E region, but two out of the 34 strains are missing an annotation for ORF8. Nucleotide mutations in these strains were traced relative to the reference genome being the oldest available strain in each country: MN908947 in China (2019-12-31), MN985325 in the US (2020-01- 19) , and MT450920 in Australia (2020-01-25). The statistical significance of C to U mutations relative to all other mutations was established using the non-parametric Wilcoxon rank-sum test with continuity correction.

The count, location, and identity of all non-synonymous SNPs were determined for each MAFFT aligned protein coding region (e.g., ORF1ab, see S6 File) using DAMBE's "Seq. analy-sis|Codon substitution pattern, reference = Wuhan-Hu-1, MN908947". Next, the count, identity, and location of all SNPs at each viral region were determined using DAMBE's "Seq. analysis|Site-specific Nuc. Freq.". This output was then compared with the output that contains non-synonymous substitutions to obtain the count, identity, and location of all synonymous substitutions. Similarly, the count, identity, and location of all non-coding SNPs in the two non-protein coding regions (5' UTR and 3' UTR) were determined using DAMBE's "Seq. analysis|Site-specific Nuc. Freq.".

The above outputs were further processed to determine the unique locations to obtain sitespecific C to U mutations in each viral region. These outputs were used to determine the identities of the flanking nucleotides for all site-specific C to U mutations to generate the 5' NC, 5' NNC, and 5' NNCN motif contexts (underlined are the C to U mutation sites, and N is any nucleotide). Next, the total numbers of 5' NC, 5' NNC, and 5' NNCN motifs in the Wuhan-Hu-1 genome were determined using DAMBE's "Sequences|Extract motif context". Finally, these values were used to calculate the odds-ratio for each motif: the observed proportion of motifs with C to U mutations (e.g., number of 5' AC with C to U mutations divided by total number of 5' AC dinucleotides in Whuan-Hu-1 genome = 36/2023) divided by the expected proportion of C to U mutation (total number of C to U mutations divided by total number of C in Wuhan-Hu-1 genome = 98/5492). For example, the odds-ratio of 5' AC is (36/2023)/(98/ 5492) = 0.997.

Next, for putatively edited C containing substrates on the Wuhan-Hu-1 genome, a 5 nt motif NNCNN was extended by 8 nt on either side to obtain a 21 nt sequence. To obtain the folding energy of the 21 nt sequence and obtain the secondary structure of the 5'NNCN motif, we used Minimum Folding Energy (MFE,-kcal/mol) via the Vienna RNA Folding Library [78] , with the following options: no lonely pairs, Temperature = 37˚C (S6 File).

We determined which human tissues are regularly infected by coronaviruses and whether these tissues express ZAP and APOBEC3 in abundance. S1 Fig shows the tissue-specific mRNA expressions for AVP isoforms in humans and the number of tissue infection records for SARS-CoV-2, SARS-CoV, and MERS. For each susceptible tissue, Fig 1 shows the relative mRNA expressions (in PME, Eq 1) of AVPs determined as high (in green) or low (in red) (see Materials and Methods for validation of PME). Furthermore, the regular habitats of each coronavirus were determined based on the highest COD (Eq 2, See Materials and methods for determination of regularly infected tissues). The lungs and the intestines are regular habitats of SARS-CoV-2 and both tissues contain high PMEs for many APOBEC3 isoforms (Fig 1: A3A , A3B, A3D, A3G, A3H in the lungs, and A3B, A3D, A3G, and A3H in the intestines) and for ZC3HAV1. Similarly, the regular habitats of SARS-CoV (lungs) and MERS (lungs) also contain high PMEs for some APOBEC3 and ZAP isoforms (Fig 1) . Therefore, all three surveyed human coronaviruses can regularly infect host tissues where both ZAP and APOBEC3 mRNAs are expressed in abundance and they display no strong preference for tissues deficient in either ZAP or APOBEC3 transcripts.

In an approach similar to Koning et al. , showing that APOBEC3 mRNAs are abundant in the lung relative to other non-lymphoid tissues. Tissues such as the brain, liver, heart, skeletal muscle, and kidney are all deficient in both ZAP and APOBEC3 mRNAs. The stomach, pancreas and testes abundantly express a subset of APOBEC3 enzymes but are ZAP deficient. In contrast, tissues of lymphoid organs including the spleen, adrenal gland, and thyroid express both AVPs in abundance.

Retrieving averaged mRNA expression levels of ZAP and APOBEC3 in four other mammalian species (cattle, dog, pig, mice) and their tissue-specific records of coronavirus infection (Figs 2 and S2) reveals the tissues most susceptible to infection for these species (by highest CODs, tissues shaded in blue-gray), as well as the relative mRNA expressions (PMEs) for AVP isoforms in these tissues. Like human coronaviruses, these mammalian coronaviruses also regularly infect tissues exhibiting both high APOBEC3 and ZAP mRNA expressions. Examples include HEV infecting pig liver, CRCoV infecting dog intestines and lungs, and Bovine CoV infecting cattle intestines. Conversely, while MHV regularly infects the brain in mice (Fig 2) , PMEs for both APOBEC3 and ZAP are low in this tissue.

Taken together, Figs 1 and 2 show that lungs and intestines are regularly infected by five out of seven surveyed coronaviruses and exhibit high abundances of AVPs in all five mammals, except for lungs in cattle. This suggests that tissue-specific APOBEC3 and ZAP expressions may be correlated. Based on 24 human tissues, PMEs of APOBEC3 and ZAP are significantly positively correlated (e.g., for fitted regression line between 24 A3H and ZC3HAV1 values: coefficient of determination R 2 = 0.43, P < 0.001). Similarly, we found significant positive correlations between the PMEs of both AVPs in 17 mice tissues (APOBEC3 vs ZC3HAV1: R 2 = 0.49, P = 0.0017) and ten dog tissues (APOBEC3Z3 vs ZC3HAV1: R 2 = 0.56, P = 0.021). In contrast, there is no significant correlation between PMEs of both AVPs in 26 cattle tissues The lines show the relative mRNA expressions in PME, for each APOBEC3 and ZAP isoform, among tissues having records of SARS-CoV-2, SARS-CoV, and MERS infections. Dots highlighted in green and red are PME values that are greater and lower than the averaged PME values, respectively. These PME values were calculated based on averaged mRNA FPKMs retrieved from the GTEx Portal [61] . For each tissue, the commonness of viral detection (COD) score is appended in brackets next to tissue name. Shaded in light blue-gray are tissues that were determined to be regularly infected by the coronavirus (based on highest COD scores).

(APOBEC3H vs ZC3HAV1: R 2 = 0.22, P = 0.34) or 33 pig tissues (APOBEC3H vs ZC3HAV1: R 2 = 0.11, P = 0.065).

Differential transcriptomic analysis of uninfected and infected LoM lung epithelial cell isolates revealed that among ADAR, AID, ZAP, APOBEC1, and APOBEC3 paralogues, transcripts that were found to be significantly differentially expressed between uninfected and infected human lung endothelial cells encoded A3G, ADAR, and ZAP. In all cases, the time after infection (on the time scale considered) was less of a contributing factor to expression levels than the intrinsic presence of infection (Figs 3 and S3 ). This is evidenced by the consistent clustering of the uninfected control samples contrasted with the greater variance in the transcriptspecific clustering of TPMs within infection time points. These results generally support the notion that ADAR, A3G, and ZAP transcripts are either upregulated during SARS-CoV-2 infection relative to uninfected lung epithelial cells or remain at similar levels. Pig, dog, and cattle tissues that are regularly infected by their respective coronaviruses (HEV, CRCoV, Bovine CoV) have high AVP mRNA expressions, but the mice brain that is regularly infected by MHV does not have high AVP mRNA expressions. The lines show the relative mRNA expressions (PME), for each APOBEC3 and ZAP isoform, among tissues having records of viral infections. Dots highlighted in green and red are PME values that are greater and lower than the averaged PME values, respectively. These PME values were calculated based on averaged mRNA expressions retrieved from the Bovine Genome Database [66] , BioProject PRJNA124245 [65] , TISSUE 2.0 integrated datasets [64] , mouse ENCODE consortium [62] and BioProject PRJNA516470 [63] . For each tissue, the commonness of viral detection (COD) score is appended in brackets next to tissue name. Shaded in light blue-gray are tissues that are regularly infected by the virus (based on highest COD scores).

https://doi.org/10.1371/journal.pone.0244025.g002

In particular, the results we observed for A3G are consistent with those observed during influenza A infection of A549 lung epithelial cells by Pauli et al.

[24] insofar as A3G was upregulated to the exclusion of other APOBEC3 paralogues and a corresponding significant upregulation of IFNβ-encoding transcripts was generally observed in tandem, with the strongest coupling occurring in the samples with the greatest TPM fold-change (Fig 3) . Apart from A3G, the sample from the SRX8839376 experiment demonstrated especially high upregulation of all transcripts of interest, followed by SRX8839375 and SRX8839379. All of these are from earlier infection time points (2 and 6 days following infection), suggesting that sharper expression profile changes tend to happen earlier in infection. In contrast, the data make it clear that the cellular response to SARS-CoV-2 infection varies quite substantially. While the control samples have the most closely related expression profiles, the infected samples varied far more widely in their expression profiles and did not cluster strongly in a time-dependent manner. This emphasizes that immunological differences between individuals likely play an important role in combatting infection.

Upon comparing the CpG and U contents of coronaviruses, we found those that regularly infected AVP-rich tissues tend to exhibit diminished CpG content in tandem with elevated U content. Conversely, MHV neither targeted AVP-rich tissues, nor did its genome indicate directional mutation with respect to CpG or U content. In both trimmed genomes (Fig 4A) and whole genomes (S4A Fig), MHV had the highest I CpG (about 0.6 or higher) while SARS--CoV-2 had the lowest I CpG (below 0.43 in all but two strains). As for all other coronaviruses surveyed, they also exhibited low I CpG < 0.5 except for MERS being slightly higher. It should be noted that among the seven coronaviruses, I CpG values also showed the greatest variation among MHV genomes but are much more constrained among the other six genomes (S5A Fig). Nonetheless, in all seven coronaviruses, median I CpG is the most deficient among I XpY calculated (where X and Y are A, C, G, or U) and no other dinucleotides display strong deficiency or surplus (S6 Fig).  Fig 4 panels b , c, and d show that the proportion of U nucleotides (P U ) is inverse to the proportion of C nucleotides (P C ), but P U does not correlate with P A or P G . Bovine CoV, CRCoV, and HEV all have very high P U and conversely very low P C . In comparison, MHV does not regularly infect tissues highly expressing APOBEC3 and has relatively reduced P U and increased P C (Fig 4B) . Similar to I CpG , P U was least constrained in MHV relative to any other coronavirus (S5B Fig). Among human coronaviruses, genomic P U is low in SARS-CoV-2 and MERS and especially in SARS-CoV (Fig 4B) . These patterns persisted when I CpG and P U were re-analyzed using whole, untrimmed, genomes (S4B Fig) .

Our above results demonstrate that viral genomes exhibit pronounced shifts towards CpG deficiency and elevated U content when the virus regularly infects tissues with high expression of both AVPs. However, human viruses share similar or lower global P U relative to MHV, which predominantly infects AVP-deficient mice tissues (Fig 4) . To better understand the distribution of U content, we examined whether there has been a history of P U elevation in local 

SARS-CoV-2 regions over the span of the first four months since the virus was first isolated. We observed that most SNPs are C to U mutations (Fig 5A) , and these mutations are prevalent at the 5' UTR and ORF1ab regions but infrequent at other viral regions (Fig 5B) .

We next assessed temporal SNP patterns in each of 12 SARS-CoV-2 regions (excluding ORF7b) from a sample of 99 SARS-CoV-2 strains (see Materials and Methods). We observed a striking number of C to U mutations in aligned sequences between the reference and sampled strains (Fig 6) , and the total number of C to U mutations trends upward over time in the 5' UTR and ORF1ab regions, but other regions did not exhibit any clear C to U mutation patterns. Other notable substitution patterns were observed in the S region and ORF3a regions, namely: A to G mutations and G to U mutations, respectively.

To control for potential geographical bias, such as a widespread C to U hypermutation before SARS-CoV-2 was transmitted outside of China, we show that the prevalence of C to U mutations in 5' UTR and ORF1ab regions persists when considering the temporal and geographical SNP patterns of SARS-CoV-2 strains with unique sample dates are isolated from three different countries: United States, Australia, and China (Figs 7 and S7-S9) . In all three countries, aligned sequences revealed the same pattern of C to U mutations we previously observed. Likewise, C to U mutations trended upwards over time in the 5' UTR and ORF1ab (Fig 7) , but this bias was absent in other regions (S7-S9 Figs). The number of C to U mutations is significantly greater than for any other mutations in the ORF1ab region regardless of geographical constraint (all ORF1ab panels in Figs 5 and 6, Wilcoxon rank sum test with continuity correction: P < 0.01). Indeed, only C to U mutations were observed in the 5' UTR region of strains collected from the United States and Australia (Fig 7) , but no two countries shared the same SNP patterns in other viral regions (S7-S9 Figs).

We next investigated the sequence context for the prevalently observed C to U mutations. A total of 477 SNPs were observed comparing the reference genome (Wuhan-Hu-1) to a sample of 98 SARS-CoV-2 strains. Over half of these SNPs were C to U mutations (262/477, S1 Table) , and 144 and 82 C to U mutations were synonymous and non-synonymous substitutions, respectively. These 262 C to U mutations were found at 98 unique nucleotide sites with respect to the reference genome of Wuhan-Hu-1 (S10 Fig), with 92 unique sites located within protein coding regions, 55 of which accounted for the synonymous substitutions and 37 sites were associated with non-synonymous substitutions. The remaining six unique mutation sites were not located within protein coding regions. Furthermore, the locations of unique sites subjected to C to U mutations were roughly evenly distributed across the Wuhan-Hu-1 genome (S10A Fig); these SNPs were not densely packed at any specific sequence region (S10B Fig). 

Many of the aforementioned 98 unique C to U mutations sites occur at 5' CC and 5' UC (with C to U mutation sites underlined) dinucleotides embedded in motifs that facilitate APOBEC3 binding and editing in HIV-1, MLV, and SIV (Table 2 ). For each dinucleotide and 

trinucleotide motif, odds-ratios were calculated using the observed proportion of C to U mutations divided by the expected proportion of C to U mutations (see Materials and Methods). Specifically, the preferred dinucleotides are 5' CC and 5' UC, and the preferred trinucleotides are 5' ACC, 5' CCC, and 5' UUC (by highest odds-ratio > 1 in bold).

Underlined are sites subjected to C to U mutations. In red are non-APOBEC3 deaminases that were reported to have C to U/T editing ability in non-viral sequences. 1 [79] ;

Among dinucleotides, only 5'CC and 5' UC have odds-ratios > 1 (observed > expected) with 1.136 and 1.111, respectively. In viruses such as HIV, MLV, and SIV, most studies are consistent in demonstrating that A3G prefers to edit 5' CC whereas the other six APOBEC3 enzymes prefer to edit 5' TC (Table 2 ). In the context of 5' NCC, the two most preferred trinucleotide motifs are 5' ACC (oddsratio = 1.639) and 5' CCC (odds-ratio = 1.449). This observation is consistent with multiple studies showing that 5' CCC is preferred [41, [79] [80] [81] , although others found that 5' RCC may also be preferred [51] by A3G editing in HIV-1 and MLV. When 5' NUC is considered, the preferred trinucleotide in SARS-CoV-2 is 5' UUC (odds-ratio = 1.731). This observation is also corroborated by multiple studies indicating that 5' TTC is preferred by all APOBEC3 enzymes except A3G [45-47].

We further considered C to U mutations in the context of 5' NCCN and 5' NTCN. All studies summarized in Table 3 conclude that both the -2 and +1 positions flanking 5' NC influence the efficacy of APOBEC3 editing. Comparing between reported APOBEC3 enzyme activities by independent studies, activity levels were classified as preferred (++), less preferred (+), inefficient (-), and avoided (-) among motifs examined. A3D was excluded from Table 3 because its consensus target could not be specified beyond 5' (T/A)(T/A)C(G/T) [7] ; an A3D-preferred motif has not been established [52] because the catalytic properties of this enzyme are not fully characterized [90] . Despite the lack of a strongly preferred consensus sequence among many APOBEC3 enzymes, most studies are consistent in reporting 5' CCC(A/T) as preferred targets of A3G, and 5' TTC(A/T) are among, if not the most, preferred motifs by all 5' TC editing APOBEC3 enzymes except A3B (Table 3) .

In SARS-CoV-2, the two tetranucleotides embedding the 5' UC editing target with the highest number of unique C to U mutations were 5' UUCA and 5' UUCU (odds-ratios 1.937 and 1.399, respectively), followed by 5' GUCA, 5' CUCA, and 5'UUCG (odds-ratios 1.542, 1.153, 3.874, respectively), and all except 5'UUCG are preferred APOBEC3 editing motifs (Table 3) . However, the A3G-preferred 5' CC motifs 5' CCC(A/U) (e,g. [91] ) were not found and 5' ACC (A/U) were instead abundant in SARS-CoV-2 (Table 3) Table 3: 16 out of 23, see secondary structure details in S6 File). Of 5' (C/U)C targets in 5' N(C/U)CG that were reported to prefer the loop region by Sharma and Baysal [56], three 5' NUCG motifs were observed in SARS-CoV-2 (all being 5' UUCG), two were found in the loop region and one was found in the stem region; in contrast, only one 5' NCCG (5' ACCG) motif was observed and it had an open structure (Table 3) .

Lastly, we performed a temporal analysis to determine whether there are differences in I CpG within and between viral regions among the 99 SARS-CoV-2 strains. Within viral regions, there were no notable differences in I CpG between strains sampled at different time intervals (Fig 8) . However, there were notable differences in I CpG between viral regions. In particular, the ORF1ab, S, and ORF6 regions had the lowest I CpG values < 1, whereas the 5' UTR, E, and ORF10 regions had the highest I CpG values > 1. Thus, CpG content varies substantially across the SARS-CoV-2 genome [92, 93] .

Edited motifs The seven APOBEC3 members were abbreviated in the table to show only the last letter of the enzyme (e.g., A3A = A). b -The relative APOBEC3 activity levels at surveyed motifs were designated with + and-symbols: "++" = preferred, "+" = less preferred, "-" = inefficient, "--" = avoided.

The "na" indicates that data is unavailable. Activity levels reported by different studies are separated by a "/" symbol, and the order of activity data corresponds to the order of cited references shown in the table header.

https://doi.org/10.1371/journal.pone.0244025.t003

Next, we examined the CpG content in the SARS-CoV-2 genome in the context of CN X GNCG motifs that were preferably recognized by ZAP in mice [16] . When the reference Wuhan-Hu-1 was compared to the other 98 strains sequenced in the following four months, there were only 11 unique sites where either C or G, in the context of CpG, had been mutated, and only two out of the 11 mutations occurred in the context of CN X GNCG. In addition, other C or G mutations, in the context of CpG, do not particularly prefer CG-rich sequences (S6 File). Note that we determined the number of mutations that have occurred at unique sites when referenced to the genome of Wuhan-Hu-1, because while a mutation at a given site may be carried by multiple later strains, the creation of such mutations could be derived from singular events. Nevertheless, there was a deficit in the total number of observed CG dinucleotides (Table 4 : Obs/Exp ratio = 0.408) and CN X GNCG motifs (Table 4 : Obs/Exp ratio ranges from 0.309 to 0.619) at the genome of Wuhan-Hu-1.

SARS-CoV-2 poses a serious global health emergency. Since its outset in Wuhan City, Hubei province of China in December 2019, the viral outbreak has resulted in over 20 million confirmed cases around the world (https://www.who.int/emergencies/diseases/novelcoronavirus-2019, last accessed August 12, 2020). The pandemic has prompted an immediate global effort to sequence the genome of SARS-CoV-2, and by May 2020 over 28000 strains have been publicly deposited over the course of just four months. With a wealth of sequence data, we performed a comprehensive comparative genome study on SARS-CoV-2 and six other coronaviruses across five mammalian species, with the aim to understand how coronaviruses evolve in response to tissue-specific host immune systems.

We tested whether APOBEC3 and ZAP immune responses act as selective pressures to shape the genome of an infecting coronavirus. We note that ZAP is highly expressed in human lungs (Fig 1) and we observed that its expression is further upregulated in SARS-CoV-2 infected lung epithelial cells relative to the control (Fig 3) . Our observations are compatible with the notion that cytoplasmic ZAP can bind to CpG dinucleotides to facilitate the degradation of viral transcripts. This idea, in conjunction with our observations, is corroborated by a recent study that found ZAP targets CpG to restrict SARS-CoV-2 replication in human lung cells [17] . In contrast to ZAP, APOBEC3 enzymes are mostly expressed in immune cells such as CD4 + T cells residing in tissues [22] . SARS-CoV-2 infection triggers T cell response in infected patients [94] , and the ability of CD4 + T cells to recognize a virus would then allow APOBEC3 enzymes to be packaged into the virions and cause RNA-editing [20] .

We predicted that viral genomes should be driven towards reduced CpG dinucleotides to elude ZAP-mediated cellular antiviral defense, and increased U residues because of RNA editing by APOBEC3 proteins. In line with our expectations, we found compelling hallmarks of CpG deficiency as well as elevated U with lowered C contents in the genomes of SARS-CoV-2, SARS-COV, MERS, Bovine CoV, CRCoV, and HEV that regularly infected mammalian tissues expressing both AVPs in abundance (Fig 4) . Unsurprisingly, these sequence trends were absent from MHV genomes (Fig 4) as this virus regularly infects mice tissues that lowly express AVPs (Fig 2) . Corroborating this observation, both I CpG and P U values showed the greatest variation among MHV strains (S5 Fig), suggesting that MHV genomes are not constrained by either AVP. These results suggest that when a virus regularly infects host tissues that are abundant in ZAP and APOBEC3, these AVPs shape the molecular evolution of viral genomes in two ways: CpG deficiency allows the virus to evade ZAP-mediated antiviral defense, and elevated U content due to APOBEC3 editing activity.

Among three human coronaviruses, SARS-CoV-2 genomes exhibit the most CpG deficiency (Fig 4A) . Many recent studies point to Bat CoV RaTG13 as the most closely related known relative of SARS-CoV-2 when the whole genome is considered [95] [96] [97] [98] , and to the bat Number of unique mutations were determined on the Wuhan-Hu-1 genome when it is compared to 98 later strains, in the context of CN X GNCG (underlined are either C or G nucleotides that were mutated). N X indicates the spacer sequence of length x = 4 nt to 8 nt. The "Observed number" indicates the total number of nucleotide and motifs observed in the genome of Wuhan-Hu-1, the "Expected number" is calculated based on the total nucleotide frequencies and the length of Wuhan-Hu-1 genome, and "Obs/Exp ratio" calculates Observed number/Expected number. The specific sequence contexts of all 11 unique C or G mutations that occurred at CpG are shown in S6 File.

https://doi.org/10.1371/journal.pone.0244025.t004

Rhinolophus affinis as a potential intermediate host or reservoir for SARS-CoV-2 [99] . Indeed, the I CpG values in SARS-CoV-2 are comparable to that of Bat CoV RaTG13 infecting Rhinolophus affinis but lower than that of many other coronaviruses surveyed [18] . Nevertheless, global CpG is not more deficient in SARS-CoV-2 than many other highly pathogenic coronaviruses [92] . Despite this, CpG deficiency largely fluctuates in local coding regions [92, 93, 100] ; the S, ORF1ab, and ORF6 regions have the most severe CpG deficiencies (Fig 8, I CpG < 0.4) , whereas the 5' UTR, E, and ORF10 regions have CpG surplus with no signs of CpG deficiency (Fig 8, I CpG > 1) . This may be surprising since one would expect that maintaining high CpG, regardless of its location, should have a detrimental effect on the virus. However, ZAP-mediated RNA degradation is cumulative [7] . When CpG dinucleotides are added to individual viral segment 1 or 2 in HIV-1, the inhibitory effect of ZAP is weak, but when the same CpG dinucleotides are added to both segments 1 and 2, the ZAP inhibition effect is strong. This implies that longer RNA sequences (ORF1ab and S) are more likely to be targeted by ZAP.

Moreover, a study on early SARS-CoV-2 genome evolution [92] suggests that the CG-rich N region is biased towards mutations lowering CpG content, whereas the CpG levels remain consistently low in the S region. Nonetheless, we found no notable change in I CpG between 99 SARS-CoV-2 strains sampled in four months since its first detection (Fig 8) , and occurrences of unique mutations at the CG dinucleotides in the context of the CG-rich CN X GNCG motifs known to be preferred for ZAP targeting in mice [16] were rare (Table 4 ). This suggests that the evolutionary adaptation to CpG deficiency had not been a rapid process. Despite this, the first isolated SARS-CoV-2 genome is deficient in both CG dinucleotides and CN X GNCG motifs (Table 4) . Hence, SARS-CoV-2 may have preadapted to a low-CpG human environment as its closest RaTG13 counterpart is likewise CpG deficient [17] . Altogether, our results are consistent with recent studies suggesting that, similar to the HIV-1 genome [7] , the SARS--CoV-2 genome appears to be CpG deficient to evade ZAP recognition [17] .

While APOBEC3 enzymes are highly expressed in immune cells [21, 22] , they are also detected in mammary and lung epithelial cells [23, 24 ]. An analysis of total RNA at the tissue level (Fig 1) showed that APOBEC3 enzymes are highly expressed in healthy lung tissues in comparison to other non-lymphoid tissues; this is consistent with results reported by Koning et al. [21] and Refsland et al. [22] . Indeed, expressions of APOBEC3 enzymes are not confined to immune organs but are dependent on tissue lymphocyte contents. To further localize antiviral protein expression, we examined the transcriptomic data of human lung epithelial cells in the presence and absence of SARS-CoV-2 infection (Fig 3) . Our findings were consistent with Pauli et al.

[24] insofar as we observed selective upregulation of A3G to the exclusion of other APOBEC3 paralogues in the lung epithelial cells during viral infection. This suggests that tissue-residing immune cells are predominantly contributing to the APOBEC3 levels observed in tissue total RNA (Fig 1) , particularly with respect to the high abundance of A3A in lung tissue. It remains likely that tissue-residing immune cells are primarily responsible for variable APO-BEC3 expressions at the tissue level.

A survey of complete SARS-CoV-2 genomes did not indicate drastically increased U and decreased C contents (Fig 4B) . Nonetheless, over the span of four months since the virus was first isolated, there has been a history of P U elevation and strong bias for C to U mutations relative to other substitutions. These C to U mutations are mostly located in the 5' UTR and ORF1ab regions (Figs 5-7) , accounting for over half of all SNPs in SARS-CoV-2. That we observed the same prevalence of C to U mutations in the 5' UTR and ORF1ab regions in strains collected from three different countries (Fig 7) suggests that geographic patterns of sampling were not a confounding factor. Indeed, these results suggest that SARS-CoV-2 is consistently biased towards C to U mutations.

Consensus motifs embedding C to U mutations that are acted on by APOBEC3 enzymes have been experimentally verified in HIV-1. To support the hypothesis that C to U mutations in SARS-CoV-2 are catalyzed by APOBEC3 enzymes, we determined that the preferred C to U mutation hotspots in SARS-CoV-2 are the same as those in HIV-1. As summarized in Table 2 , most studies have shown that 5' CC and 5' CCC (with C to U mutation sites underlined) are the preferred consensus motifs subjected to RNA editing by A3G in HIV-1 and MLV. As for other APOBEC3 enzymes, 5' TC and 5' TTC are the preferred consensus motifs that are subjected to RNA editing in HIV-1, MLV, and SIV. Similarly, C to U mutations are prevalent in the aforementioned sequence contexts in SARS-CoV-2, suggesting that APOBEC3 enzymes may indeed edit the SARS-CoV-2 genome. Furthermore, among 5' N(C/U)CN mutations in SARS-CoV-2, the APOBEC3-preferred 5' UUC(A/U) were the most commonly observed (Table 3) (Table 3) were not found and 5' NCCG motifs did not prefer the loop region as shown by Sharma and Baysal [56] . As APOBEC3 enzymes can be efficiently co-packaged into the same viral particle [50], these results suggest that while all A3A, A3B, A3C, A3F, and A3H could contribute to the prevalence of 5' UC to 5' UU mutations, the effect of A3G weakly explains 5' CC to 5' CU mutations in SARS--CoV-2. This is consistent with the observation made by Pauli et al. [24] regarding A3G exhibiting no antiviral efficacy during influenza A infection despite its upregulation in that context.

It is not excluded that other deaminase enzymes may contribute to RNA editing in SARS--CoV-2. The preferred consensus motif that is subjected to editing by AID is 5' UAC in the rpoB gene in E. coli [81] ; this may explain why C to U mutations were also preferred at 5' UAC in SARS-CoV-2 (Table 2 ). In addition, 5' TC is preferentially edited by APOBEC1 in chicken Bcells [88] . Nonetheless, it is unknown whether these enzymes possess the ability to target viral genomes [81, 88] . Another noteworthy observation is that A to G mutation was preferred in the S region and the numbers of A to G mutations in this specific region were increasing over time (Fig 6) . This mutation may be caused by the ADAR enzyme [34, 38] , which edits A into I and subsequently into G, in viruses that infect the lungs such as Influenza virus A and Measles virus [101, 102] . Although ADAR was known for targeting double-stranded RNAs and not singlestranded RNAs [35, [103] [104] [105] , the secondary structure of viral genomes often contains local regions of base-pairings as possible substrate for ADAR. A survey of APOBEC1, AID, and ADAR expressions in 27 human tissues (S11 Fig) shows that APOBEC1 and AID, but not ADAR, are most expressed in the small intestines. None are highly expressed in the lungs, but ADAR mRNA expression was upregulated in SARS-CoV-2 infected lung epithelial cells in comparison to the control (Fig 3 and S3 Fig) . Therefore, in addition to APOBEC3, other host deaminase systems such as A1, AID, and ADAR may act to edit the genome of SARS-CoV-2; however, both AID and A1 deaminases are DNA mutators that are not known to target viruses [81, 88] .

The current study focuses on the evolutionary pressure that host immune systems exert onto viral genomes. Our aim is to prompt motivations for vaccine designs in the development of attenuated RNA viruses. Previous experimental works have shown that increasing CpG dinucleotides in CpG-deficient viral genomes drastically decrease viral replication and virulence [10, [106] [107] [108] [109] [110] , and in recent years several studies have proposed vaccine development strategies involving increased CpG to attenuate RNA viruses [5, 10, 107, 109] . Increasing CpG content may provide a good starting point for strategies to attenuate SARS-CoV-2. On the other hand, because C to U deamination cannot be proof-read by viral exonuclease Nsp14-ExoN [36, 111, 112] , host innate deaminases may drive up the rate of evolution in viral genomes [34, 49] . The possibility of APOBEC3 editing activity and its potential influence on the pathogenesis and drug resistance of viruses such as SARS-CoV-2 in the long term requires further investigation and scrutiny.

The color spectrum from blue (higher) to white (median) to red (lower) indicates the comparative tissue-specific mRNA expressions of seven APOBEC3 isoforms (A3A, A3B, A3C, and A3D, A3F, A3G, A3H) and two ZAP isoforms (ZC3HAV1 and ZC3HAV1L) within each independent study. Similarly, for tissue habitats of the viruses, the color spectrum (from more blue to white) indicates the prevalence of tissue infection observed from independent studies (observed most commonly to least commonly across different studies, respectively). Light grey indicates tissues with no expression data or for which we encountered no peer-reviewed reports of infection. Table. The summed number of C to U mutations and other mutations (non-C to U) that were observed at each viral region when the reference SARS-CoV-2 strain Wuhan-Hu-1 (accession MN908947) was compared to 98 SARS-CoV-2 genomes collected worldwide with unique collection dates. (PDF) S1 File. File S1 is the dataset containing reference compilation of virus regular habitats, tissue total RNA AVP mRNA expressions, and LoM AVP mRNA expressions. 

Pathological evidence for residual SARS-CoV-2 in pulmonary tissues of a ready-for-discharge patient

SARS-CoV-2 productively infects human gut enterocytes

Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibodybased Proteomics. Molecular & Cellular Proteomics

Structure of the zinc-finger antiviral protein in complex with RNA reveals a mechanism for selective targeting of CG-rich viral sequences

CpG Dinucleotides Inhibit HIV-1 Replication through Zinc Finger Antiviral Protein (ZAP)-Dependent and -Independent Mechanisms

The zinc-finger antiviral protein recruits the RNA processing exosome to degrade the target mRNA

CG dinucleotide suppression enables antiviral defence targeting non-self RNA

Zinc-finger antiviral protein inhibits HIV-1 infection by selectively targeting multiply spliced viral mRNAs for degradation

The role of ZAP and OAS3/RNAseL pathways in the attenuation of an RNA virus with elevated frequencies of CpG and UpA dinucleotides

CpG-Recoding in Zika Virus Genome Causes Host-Age-Dependent Attenuation of Infection With Protection Against Lethal Heterologous Challenge in Mice

Patterns of evolution and host gene mimicry in influenza and other RNA viruses

Within-patient mutation frequencies reveal fitness costs of CpG dinucleotides and drastic amino acid changes in HIV

The influence of CpG and UpA dinucleotide frequencies on RNA virus replication and characterization of the innate cellular pathways underlying virus attenuation and enhanced replication

Patterns of oligonucleotide sequences in viral and host cell RNA identify mediators of the host innate immune system

Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling

SARS-CoV-2 Is Restricted by Zinc Finger Antiviral Protein despite Preadaptation to the Low-CpG Environment in Humans

Extreme Genomic CpG Deficiency in SARS-CoV-2 and Evasion of Host Antiviral Defense. Molecular biology and evolution

Role and Mechanism of Action of the APOBEC3 Family of Antiretroviral Resistance Factors

APOBECs and virus restriction

Genome-wide identification of zero nucleotide recursive splicing in Drosophila

The Genotype-Tissue Expression (GTEx) project

A comparative encyclopedia of DNA elements in the mouse genome

Conservation, acquisition, and functional impact of sex-biased gene expression in mammals

TISSUES 2.0: an integrative web resource on mammalian tissue expression. Database

A compendium of canine normal tissue gene expression

Bovine Genome Database: new annotation tools for a new reference genome

Proteomics. Tissuebased map of the human proteome

Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics

Near-optimal probabilistic RNA-seq quantification

The Lair: a resource for exploratory analysis of published RNA-Seq data

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Gene-level differential analysis at transcript-level resolution

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

DAMBE7: New and Improved Tools for Data Analysis in Molecular Biology and Evolution

Pervasive CpG suppression in animal mitochondrial genomes

Compositional biases of bacterial genomes and evolutionary implications

Vienna RNA secondary structure server

Single-strand specificity of APOBEC3G accounts for minus-strand deamination of the HIV genome

A second human antiretroviral factor, APOBEC3F, is suppressed by the HIV-1 and HIV-2 Vif proteins

A portable hot spot recognition loop transfers sequence preferences from APOBEC family members to activation-induced cytidine deaminase

Efficient deamination of 5-methylcytidine and 5-substituted cytidine residues in DNA by human APOBEC3A cytidine deaminase

Cytidine deaminase efficiency of the lentiviral viral restriction factor APOBEC3C correlates with dimerization

A3A chimera inhibits HIV replication

Human APOBEC3B is a potent inhibitor of HIV-1 infectivity and is resistant to HIV-1 Vif

Polymorphisms and splice variants influence the antiretroviral activity of human APOBEC3H

HIV-1 Vif adaptation to human APOBEC3H haplotypes

The RNA editing enzyme APOBEC1 induces somatic mutations and a compatible mutational signature is present in esophageal adenocarcinomas

Sequence and structural determinants of human APOBEC3H deaminase and anti-HIV-1 activities

Biochemical Basis of APOBEC3 Deoxycytidine Deaminase Activity on Diverse DNA Substrates

Conserved footprints of APOBEC3G on Hypermutated human immunodeficiency virus type 1 and human endogenous retrovirus HERV-K(HML2) sequences

The heterogeneous landscape and early evolution of pathogen-associated CpG and UpA dinucleotides in SARS-CoV-2

Intra-genome variability in the dinucleotide composition of SARS-CoV-2

SARS-CoV-2-specific T cell immunity in cases of COVID-19 and SARS, and uninfected controls

The proximal origin of SARS-CoV-2

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and coronavirus disease-2019 (COVID-19): The epidemic and the challenges

Structural basis of receptor recognition by SARS-CoV-2

On the origin and continuing evolution of SARS-CoV-2. National Science Review

Are pangolins the intermediate host of the 2019 novel coronavirus (SARS-CoV-2)?

The Architecture of SARS-CoV-2 Transcriptome

Doublestranded RNA adenosine deaminase ADAR-1-induced hypermutated genomes among inactivated seasonal influenza and live attenuated measles virus vaccines

RNA editing enzyme adenosine deaminase is a restriction factor for controlling measles virus replication that also is required for embryogenesis

A-to-I RNA editing-immune protector and transcriptome diversifier

The Epitranscriptome and Innate Immunity

Moderate mutation rate in the SARS coronavirus genome and its implications

Increasing the CpG dinucleotide abundance in the HIV-1 genomic RNA inhibits viral replication

Genetic Inactivation of Poliovirus Infectivity by Increasing the Frequencies of CpG and UpA Dinucleotides within and across Synonymous Capsid Region Codons

CpG and UpA dinucleotides in both coding and non-coding regions of echovirus 7 inhibit replication initiation post-entry. eLife

RNA virus attenuation by codon pair deoptimisation is an artefact of increases in CpG/UpA dinucleotide frequencies. eLife

The CpG dinucleotide content of the HIV-1 envelope gene may predict disease progression

Infidelity of SARS-CoV Nsp14-exonuclease mutant virus replication is revealed by complete genome sequencing

Coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics

We thank Alibek Kruglikov and Heba Farookhi for discussion.

Conceptualization: Yulong Wei, Jordan R. Silke, Xuhua Xia.