key: cord-0007101-f4rx64zj
authors: Lošdorfer Božič, Anže; Podgornik, Rudolf
title: Varieties of charge distributions in coat proteins of ssRNA+ viruses
date: 2018-01-17
journal: J Phys Condens Matter
DOI: 10.1088/1361-648x/aa9ded
sha: 2f1a96ffd2bd86f2664e02b7dc63e462ba14dab5
doc_id: 7101
cord_uid: f4rx64zj

A major part of the interactions involved in the assembly and stability of icosahedral, positive-sense single-stranded RNA (ssRNA+) viruses is electrostatic in nature, as can be inferred from the strong pH- and salt-dependence of their assembly phase diagrams. Electrostatic interactions do not act only between the capsid coat proteins (CPs), but just as often provide a significant contribution to the interactions of the CPs with the genomic RNA, mediated to a large extent by positively charged, flexible N-terminal tails of the CPs. In this work, we provide two clear and complementary definitions of an N-terminal tail of a protein, and use them to extract the tail sequences of a large number of CPs of ssRNA+ viruses. We examine the pH-dependent interplay of charge on both tails and CPs alike, and show that—in contrast to the charge on the CPs—the net positive charge on the N-tails persists even to very basic pH values. In addition, we note a limit to the length of the wild-type genomes of those viruses which utilize positively charged tails, when compared to viruses without charged tails and similar capsid size. At the same time, we observe no clear connection between the charge on the N-tails and the genome lengths of the viruses included in our study.

The nature of viral genomes plays a significant role in their encapsidation into fully-formed virions. On the one hand, the double-stranded DNA (dsDNA) or RNA (dsRNA) genomes of numerous viruses need to be packaged into pre-formed capsids by virtue of a strong molecular motor, due to the high charge density and the rigid molecular conformation of the double-stranded nucleic acids on the nanoscale [1] [2] [3] . On the other hand, viruses with more flexible single-stranded RNA (ssRNA) genomes tend to spontaneously self-assemble around the RNA filament, where the binding of the capsid coat proteins (CPs) can be guided both by structure-or sequencespecific as well as non-specific interactions [1, 2, [4] [5] [6] [7] .

In general, the assembly of CPs around an RNA molecule can be a highly-specific process, guided by, e.g. RNA packaging signals (PSs) [1, 5] . The presence of PSs alone does not, however, guarantee the packaging of RNA into virions. In addition to specific, localized structural features, RNA secondary and tertiary structure can be of importance for the interaction of the RNA with CPs prior to packaging [8] [9] [10] [11] [12] . What is more, the length of the RNA molecule itself is an important factor in the assembly, as it naturally carries a significant negative charge. The total charge of RNAs packaged into capsids of different ssRNA viruses is consistently greater than the positive charge of the basic amino acid (AA) residues lining the interiors of the capsids, making these viruses negatively overcharged [1, 2, 13] . This furthermore implies a direct relationship between the genome length and capsid charge, and in support of the importance of non-specific electrostatics driving RNA encapsidaton, the total positive charge on the capsid inner surface was observed to correlate with the length of the genomic RNA for a diverse group of ssRNA viruses [1, [14] [15] [16] .

The largest contribution to the non-specific electrostatic interactions between CPs and RNA is due to positively charged CP tail groups, whose affinity for RNA varies inversely with the ionic strength of the solution [1, 7, 17] . These tail groups are extended, highly flexible N-terminal arms and unstructured regions of varying lengths, and are present in the majority of non-enveloped spherical ssRNA+ viruses, while the CPs of enveloped viruses possess only the unstructured regions, with no extended N-terminal arms [4, 18] . CP tail groups do not simply play crucial structural roles, but are highly involved in a wide range of biological functions: their disordered nature is essential in promoting correct particle assembly and RNA encapsidation [19] , they help in the switching of the CP conformation during assembly [4] , and when these tail groups are rich in positively charged residues, they can also control the size of the assembled particles, while their removal can prevent native capsid assembly [5] . Interestingly, not all charged tail groups necessarily serve the same function-experimental evidence shows, for instance, that the tails of brome mosaic virus (BMV) and cowpea chlorotic mottle virus (CCMV) are not functionally analogous with regard to RNA packaging and seem to employ two distinct packaging mechanisms [5] .

Thus, the CP structure of ssRNA viruses can be in general characterized by the presence of two structurally distinct regions: a globular and ordered C-terminal domain involved in the formation of two anti-parallel, four-stranded β-sheets with a jellyroll topology, and an extended, flexible N-terminal domain that is only partially ordered and thus not observable in the electron density [19] . These two structural regions lead to a variety of different interactions during virion assembly, including repulsive CP-CP electrostatic interactions (which inhibit capsid assembly) competing with highly directional, specific CP-CP pairing interactions, and both sequencespecific as well as non-specific electrostatic RNA-CP interactions, which can additionally help to overcome assembly barriers [17, 20] . Both experimental and computational observations show that weak interactions are in general required for productive capsid assembly [1] , and that conformational flexibility together with the presence of disordered regionsunusually common in RNA viral proteins-can be related to the ability to interact with multiple and varying partners [21] . A computational study by Perlmutter and Hagan [22] has found that while PSs can confer arbitrarily high specificity of assembly over RNAs with uniform non-specific interactions, the degree of this specificity is overall insensitive to the underlying assembly driving forces, which can be, however, straightforwardly tuned by solution conditions (ionic strength, pH) and charge on the CP tail groups. In addition, the specificity conferred by the PSs can lead to kinetic traps in some regions of parameter space, while in others they indeed oversee a highly specific assembly. Their study found that specificity is maximal under conditions where non-specific interactions alone are slightly too weak to promote effective assembly.

Non-specific, sequence-independent interactions between RNA and the charged N-terminal tails of CPs are thus predominantly electrostatic in nature and stem from clusters of positive charge, often in form of pronounced arginine-rich motifs (ARMs) [1, 7, 18] . The fundamental importance of ARMs for RNA-CP interaction in several plant virus genera is well-established, and the positively charged N-tails can be thus envisioned to stabilize encapsidated RNAs within the virus particle [5, 23] . The existing works on N-terminal tails of virus CPs have used varying definitions of the N-tails and have either reported a direct correlation between the ssRNA genome length (and thus its total negative charge) and the net charge on the peptide arms [14, 15] , or have concluded that there is no universal genome-to-tail charge ratio [16] . In this work, we provide two explicit and complementary definitions of the N-terminal tails-based either on the secondary structure of the viral CPs or on the presence of intrinsically disordered regions in them-and use them to classify the N-tails of a large number of ssRNA+ viruses. We compare how the predicted length and charge of the N-tails vary with other parameters, such as genome length and capsid size. In addition, we determine the solvent-accessible surface of the remaining (structured) parts of CPs and obtain the ionizable amino acid residues on them. With this, we are able to study the full pH dependence of the charge on both the N-tails and the CPs, a dependence which has been observed to have a significant impact on the self-assembly of viruses [22, [24] [25] [26] [27] .

We perform our study on the CPs of various ssRNA+ viruses whose coordinate files are obtained from PDB [28] and VIPERdb [29] databases. In addition, we use the NCBI Nucleotide database [30] to extract the (approximate) genome lengths of the viruses in our dataset. In a few cases, we also use the UniProt database [31] to obtain full primary sequences of CPs that have incomplete entries in PDB/VIPERdb. While we are chiefly interested in unique viral entries, we also include a few examples of several deposited entries of the same viral CP, in order to estimate the errors of our predictions. In total, our analysis includes 116 different PDB/VIPERdb entries corresponding to 80 distinct viruses, which in turn belong to 12 different viral families: Bromoviridae, Caliciviridae, Dicistroviridae, Flaviviridae, Hepeviridae, Iflaviridae, Leviviridae, Nodaviridae, Picornaviridae, Secoviridae, Tombusviridae, and Tymoviridae; included are also several entries belonging to the genus Sobemovirus, yet unassigned to a family, and several entries of satellite viruses. A list of all the viruses in our dataset is given in table S1 in the supplementary material (stacks.iop.org/JPhysCM/30/024001/mmedia).

While CP-CP interactions play a major role in virion assembly, their contribution relative to RNA-CP interactions varies among different viruses [4, 5, 18] . Throughout this paper, we will present separately viruses from families which are known to utilize positively charged tails (Bromoviridae, Nodaviridae, Sobemovirus, and Tombusviridae), and viruses from those families which do not utilize them. In addition, we will present separately the satellite viruses as well as the viruses belonging to Leviviridae, as the CPs of the latter have a unique fold among non-enveloped ssRNA viruses, where, in the absence of N-terminal tails, the β-sheet is responsible for interaction with the viral RNA [4] .

The majority of the viruses in our dataset have a triangulation number of T = 3, meaning that their capsids are composed of 60T = 180 copies of the same protein. As a consequence of experimental methods and capsid structure reconstruction, the database entries of these viruses usually include three copies of the same CP (the asymmetric unit of the capsid [29] ), whose structure can be resolved to a different extent. Our dataset also includes numerous viruses which have a pseudo-T = 3 (T = p3) number, their capsids possessing the same symmetry as T = 3 capsids, yet consisting of three different proteins. In addition, the asymmetric units of T = p3 viruses often consist of four and not only three capsid proteins, as is the case in viruses belonging to Picornaviridae: the first three CPs form the outer surface of the capsid, while the inner surface consists of CP4 and the N-terminal part of CP1 [32] . Taking this into account, we will always calculate the quantities such as N-tail charge as charge per tail, averaged over the different chains in a dataset entry. In the case of viruses belonging to Nodaviridae, the CP is cleaved in two during virion maturation, and their dataset entries often possess three copies of each of the resulting two proteins. However, during assembly of the CPs and genome into a provirion, where the N-terminal part of the CP plays an important role [4, 33] , the CP is uncleaved [34] . And since the smaller (C-terminal) part of the cleaved protein, deposited in the database, could significantly skew the estimate of charge on the N-tail of the larger part of the protein, we thus exclude the smaller protein from our dataset. Table S2 in the supplementary material lists the number of chains used for each database entry, along with the triangulation number of each virus.

One of the main aims of this work is to provide a clear and consistent definition of a protein N-tail and analyze its consequences. The N-tail groups of CPs are structurally flexible regions, likely to be intrinsically disordered; such regions are often characterized by a high content of polar and charged residues and a low content of bulky hydrophobic residues [4, 19] . They may be ordered in the capsid via interactions with other viral components, but at the same time it is evident that they are flexible in the isolated protein. Taking this into account, we consider two complementary definitions and define the N-tail either (i) as the part of the CP extending from its N-terminus to the first occurrence of a given element of secondary structure; or (ii) as the first intrinsically disordered, contiguous region of the CP, starting again from its N-terminus.

The first definition is the more common one, taking into account the flexibility of the N-tails and contrasting it with the structured part of the CP. On the other hand, the second definition will help us examine the role of disorder in the N-tail regions of viral CPs. In the first case, (i), we use the atomic coordinates of the viral CPs and assign each AA residue a secondary structure using STRIDE [35] . (The results obtained using the assignment given by DSSP [36] match those obtained using STRIDE, and we thus focus only on the latter.) The singlecode classification of protein secondary structure given by DSSP/STRIDE involves seven structural elements, such as 3 10 (G), α (H), and π (I) helices, hydrogen bonded turns (T), β-sheets (E), isolated β-bridges (B), and bends (S). Not all of these elements, however, necessarily correspond to the structurally-ordered part of a protein and could be present in the disordered region as well. Thus, in our definition of a protein N-tail, we terminate it at the first occurrence of any of the following structures: G, H, I, and E. We allow for the presence of T, B, and S structural elements in the tail region, as these represent very short stretches of bonding patterns and should thus not inhibit the flexibility of the N-tail. This definition turns out to be the most consistent in comparisons between individual chains of the same CP, as well as to provide a good match with the predictions of our second definition of an N-tail.

Nevertheless, the above choice of the structural elements signaling the start of a structurally-ordered part of a CP is clearly not the only possible one. What is more, common secondary structure assignment methods can underpredict certain structural elements, such as π-helices [37] . For this reason, we use a second, (ii), independent definition of N-tails and compare it to the first one. We base this definition on the predicted intrinsic disorder in the viral CPs, so that the flexible N-tails should, in general, correspond to a disordered N-terminal stretch of the CP. To predict the intrinsically disordered regions, we use the Metadisorder server (MD2) [38] , one of the best predictors of protein disorder, which combines a number of different disorder predictors into a more accurate meta-prediction method. Using MD2, we thus obtain a prediction for the intrinsically disordered parts in the viral CPs based on their AA sequences. For simplicity, we consider only the first contiguous disordered region of the CP starting at its N-terminus to be the N-tail of the protein, which should be a valid assumption in most of the cases.

Due to the flexible, disordered nature of the N-tails, they often remain unresolved in structural experiments, and are usually incomplete in the data deposited in the PDB/VIPERdb databases. To remedy this, we compare the structurallyresolved part of a viral CP with its full primary sequence in order to obtain any missing residues. Such residues are consequently assigned a lack of secondary structure (C or '-' in DSSP notation) as well as full disorder (D) for use in the two different N-tail definitions, respectively.

Both N-tail definitions, (i) and (ii), yield in the end an N-terminal sequence of AA residues belonging to the flexible, disordered region of a viral CP. The remaining AA sequence and its assigned secondary structure we then attribute to the structurally-ordered body of the CP. In the rest of the paper, we will refer to the latter region simply as CP, and we will explicitly specify when we will be referring to the entire coat protein including its N-tail.

To obtain the charges on the CPs and the N-tails at any given pH, we follow the procedure fully elaborated previously in [39] . The AA residues we consider as ionizable are the aspartic acid, glutamic acid, tyrosine, arginine, lysine, and histidine. We include the charge on the N-and C-terminus, but we do not consider the acidity of cysteine, a very weak acid which can form disulfide bonds, the exclusion of which should have no qualitative influence on our study [39, 40] . The charge on the ionizable residues at a given pH is given by virtue of the Henderson-Hasselbalch equation, which yields the fractional charge of a residue k given its static dissociation constant pK (k) a :

for bases (q + k > 0) and acids (q − k < 0), respectively. For the pK a values of the different ionizable residues we use the canonical values for isolated AAs [40] . Equation (1) furthermore assumes the limit of relatively high physiological salt concentration, where the electrostatic potential does not induce a significant local shift in pK a and can thus be ignored [39] . In addition, we treat as ionizable only those AA residues which are solvent-accessible. We use STRIDE to determine the relative solvent accessibility (RSA) of each residue, with the cutoff of c = 0.2 defining the residue accessibility (i.e. RSA 0.2 defining the solvent-accessible residues). As mentioned before, certain parts of the CPs are structurally unresolved and absent in the data-these residues belong mostly to the flexible N-terminal parts of the proteins. For this reason, we treat any residues missing in the structural data as being completely accessible, assigning to them an RSA = 1.

The interplay of charge on the N-tails-especially when they are enriched for positive charge-and charge on the CPs can be of significant importance for capsid assembly. In particular, certain ssRNA+ viruses tend to preferentially utilize CP-RNA interactions in their assembly, while the capsids of others are stabilized by CP-CP interactions. In figure 2, we show the pH dependence of charge on the N-tails and CPs of two different viruses, CCMV and physalis mottle virus (PhMV), belonging to Bromoviridae and Tymoviridae families, respectively. In Bromoviridae, positive clusters of charge on the N-tails are known to play an important role in the assembly of functional virions, and the first 26 residues of CCMV carry a significant positive charge interacting strongly with the negatively charged RNA [13, 18] . Members of Tymoviridae are, on the other hand, predominantly stabilized by CP-CP interactions [5] .

The CPs of the two viruses in figure 2 show a similar pH dependence of their charge, which goes from positive to negative as pH is increased from acidic to basic, exhibiting a plateau around neutral pH values-similar to the dependence previously observed for the charge on full capsids of Leviviridae phages [40] . The CP of CCMV has an acidic isoelectric point (point of vanishing charge, where CPs can be crudely treated as electric dipoles [17, 39] ), whereas the CP of PhMV has a basic one. The charge on the N-tails and its pH dependence are, on the other hand, quite different for the two viruses. The N-tails of CCMV have a large positive charge, mostly stemming from basic AAs and in particular from pronounced ARMs [5, 23] , and the large positive charge persists far into the range of basic pH values. Charge on the N-tails of PhMV is comparatively much lower, and becomes negative early on in the pH range. This is true regardless of the definition of the N-tails we use, be it by virtue of secondary structure assignment or protein disorder prediction.

Before we analyze any further the pH dependence of N-tail and CP charges and their relations to other properties of viruses, we would like to compare more in detail the predictions of the two different definitions of N-tails proposed in the Methods section. These define the tails as either based on the assigned protein secondary structure (STRIDE) or by prediction of intrinsically disordered regions in it (MD2). Figure 3 shows the differences between the two methods in the predicted tail lengths and the average charge per tail, evaluated at three different pH values, for the entire dataset of analyzed viruses; individual plots of the differences in the predictions for two viruses, BMV and PhMV, are shown in figure S1 in the supplementary material. The example of BMV can also be seen in the sketch of figure 1 , where the tail determined by the first definition is shown in the 3D structure of the CP, showing that it indeed ends at the bulk, structured part of the CP. In addition, the partial AA sequence of the protein already indicates that the major part of positive charge on the tail stems from arginine residues. The tail determined by the second definition is, on the other hand, shorter by 25 AA, but nonetheless captures the majority of the positively charged clusters on the tail (figure S1).

From figure 3 we can see that, for most viruses in our dataset, the differences between the two N-tail definitions in the predicted length of the N-tails are below 20 AA. In general, tails defined according to the secondary structure assignment tend to be longer compared to the tails from the disorderbased definition. While these differences in tail lengths might still be large enough to change the number of charges on them, it turns out that the differences in the predicted charge on the tails are usually smaller than ±2 e 0 , and do not seem to depend on the differences in the predicted tail lengths.

There are, however, three notable exceptions to these observations, all of them belonging to Tombusviridae: tomato bushy stunt virus (PDB: 2TBV), Melon necrotic spot virus (PDB: 2ZAH), and cucumber necrosis virus (PDB: 4LLF). In all three cases, predictions based on protein disorder result in significantly smaller tails (a difference of more than 80 AA), which, as a consequence, results in an underprediction of the charge these tails carry, to an extent of almost 10 e 0 .

Interestingly, the other Tombusviridae entries in our dataset do not show these differences.

In the rest of the paper, we will use the results obtained using the first definition of N-tails (based on secondary structure assignment), keeping in mind that the results obtained using the second definition (based on protein disorder prediction) mostly match those of the first one, yet remembering that there are at the same time a few notable exceptions. For completeness, all the results obtained using the first definition and analyzed in the paper are also shown using the second definition in figures S2-S7 in the supplementary material.

In order to compare the pH dependence of the charge on the N-tails and CPs of all viruses in our dataset, we show in Ionizable AA residues are shown in red and blue (positively and negatively charged, respectively), while the predicted split of the protein into the N-tail and the structured part of the CP, based on the assigned secondary structure, is shown in gray and beige, respectively. Inset shows the primary AA sequence, assigned secondary structure, and the prediction of intrinsic disorder in a region of the protein. Secondary structure is assigned to the AA residues in the CP using STRIDE, and we define the end of the N-tail as the first occurrence of a given structural element-in this case, a β-sheet (E). Alternatively, we split the protein into the N-tail and CP based on the predicted disordered regions (D) in it, where the tail ends at the first occurrence of an ordered region (o)-in this case, this results in a shorter predicted tail. Afterwards, we assign the ionizable residues in the protein a fractional charge, allowing us to obtain the charge on the tail and the CP at any pH. Residues which are predicted to be buried (not exposed to the solvent) are not considered as ionizable, and are highlighted in a lighter color in the primary sequence. the CPs shifts overall from positive to negative when the pH increases from acidic to basic, the charge on the N-tails of viruses in these families remains positive and decreases only very slightly. It takes very basic values of pH to eliminate the positive charge these tails carry. This is signified also by the highly basic pIs of the N-tails in these viruses where positive charge on the N-tails plays a significant role ( pI 11 using the first definition of a tail; figure 5 ). In addition, this illustrates the fact that these tails consist predominantly of clusters of positive charge only (often as a part of the ARMs [1, 5, 7] ), with negatively charged residues few and far between. On the contrary, pIs of the tails of other viruses are in general more acidic and span a larger range of values. We also note that pIs of viruses where the N-tails are very short or carry almost no charge fall on the line pI tail = 14 as a consequence of the flat pH dependence 

Lastly, we wish to examine the relationship between the predicted length and charge on the N-tails with some other characteristics of the viruses in our dataset. Specifically, we will be interested in the variable lengths of the viral RNA genomesthus bearing a variable net negative charge-and in the average capsid sizes, which are in icosahedral ssRNA+ viruses cases often tightly related to a characteristic compactness of their genomes [11, 12] . Figure 6 compares the average charge on the viral N-tails and the characteristic lengths of their wild-type (WT) genomes. Interestingly and notably, we observe that genome lengths of viruses which utilize positively charged N-tails appear to reach only a limited value (∼6 knt). Other viruses, which do not possess positively charged N-tails, tend to have longer genomes, all the way up to 10 knt. The only exception of a virus with positively charged tails and a comparable genome length is the recently discovered Orsay virus (PDB: 4NWV), which infects nematodes and bears semblance to viruses in Nodaviridae family, but has not yet been classified [41, 42] . The exceptions in the opposite sense are the phages belonging to Leviviridae family, which do not possess any tails and yet pack genomes of only 3-4 knt in length. Some of these phages, such as MS2, are known to utilize RNA packaging signals to direct their capsid assembly [43] . We note that viruses from Togaviridae family, absent in our dataset, are also known to possess N-tails containing clusters of positive charge [4, 14, 18] . While these viruses tend to pack longer genomes (∼10 knt), they also have significantly larger capsids with T = 4 symmetry, unlike any virus in our dataset.

In addition, we enlarge in figure 6 the region where the viruses utilizing positively charged tails are located. While some of the viruses here show a distinct correlation between the charge on the viral N-tails and the characteristic lengths of their WT genomes, others appear without any correlation. This is in stark contrast to the claims of universality of the genome-to-tail charge ratio based on theoretical modeling (see discussion for details).

In the viruses in our dataset, the more striking relation is thus the one between the genome lengths of viruses which utilize positively charged N-tails and the genome lengths of those which do not, as the latter tend to pack much larger genomes than the former. This large difference is, interestingly, not related to the average size of the capsids into which these genomes are packaged ( figure 7) . The vast majority of the capsids have similar average size, even though their genome sizes vary significantly (as does their CP composition [5] ). The observed difference in the genome sizes of the two broad classes of viruses is also not a consequence of the viruses without any charge on the tails actually having no tails at all-the tails of the viruses in our dataset range anywhere from 0-100 AA, and the lengths of non-charged tails can still approach 50 AA ( figure S8 in the supplementary material) . Similarly, the predicted lengths of the tails also do not correlate with the genome lengths (figure S9 in the supplementary material). Of course, the absence of net charge on the N-tails of viruses with longer genomes does not mean they cannot bind the genome. In these viruses, a more complex electrostatic mechanism could be at work, involving, for instance, polyampholyte-polyelectrolyte complexation [44] [45] [46] or multipolar interactions [39, 47] . 

For many viruses, the non-specific electrostatic interactions between the positively charged, highly flexible N-terminal arms and the RNA genome play a fundamental role in their assembly and structural integrity. However, since these extended CP tail groups lack any definite structure and are thus intrinsically disordered, their definition cannot be entirely devoid of ambiguity. In an attempt to minimize this underlying vagueness, we investigated the length and the charge state of the N-tails of 80 distinct (and 116 in total) viruses in detail. We did this by introducing two concise but different definitions of a capsid protein N-tail, based either on the first occurrence of a given secondary structure element, or on the detection of an intrinsically disordered, contiguous part of the protein at the N-terminus end. In this choice we have attempted to generalize the previous work of Hu et al [15] , based on a dataset of 27 viruses, where the N-tail was defined as the flexible sequence of AAs starting from the protein N-terminus and ending at the first α-helix (H) or β-sheet (E). Their work moreover established that the disordered (free) part of the N-tail, obtained from a comparison between the tail sequence and the missing part in the experimental data, entails on the average 76% of the full tail length.

The immediate relevance of the length and charge state investigation of the N-tails is most pertinent for the elucidation of the possible universal value of the genome-to-tail charge ratio in viruses with positively charged tails. In the seminal work of Belyi and Muthukumar [14] , focused on a subclass of 15 different WT and 5 mutant ssRNA viruses that bind their genome by using long and highly basic peptide arms, they found a linear scaling between the genome length and the net charge on the capsid peptide arms. This scaling appeared to be robust with very low uncertainty. The error of the charge on the peptide arms was estimated at just ±1 residue and mainly attributed to sequence variations between virus species and the uncertainty in distinguishing flexible peptide arms from the bulk of capsid protein. However, no detailed definition of the tail was provided, and it is unclear whether and how different definitions would modify the main results. Similar conclusion regarding the universality of the genome-to-tail charge ratio was also reached by Hu et al [15] , who report a scattered charge inversion ratio with a median value 1.8 for a subset of 13 from the 27 viruses described above. Contrasting the claims of universality, a study by Ting et al [16] , which used a thermodynamic framework to determine the optimal genome length in electrostatically-driven viral encapsidation, led to an opposite conclusion. Namely, they found no universal genome-to-capsid charge ratio, and that a fitted linear relationship between the genome and capsid charge is quite sensitive to the choice of viruses included in the dataset. Nevertheless, all the viruses from the [14] and [15] were found to be overcharged with respect to the packaged genome, a situation which the authors attribute to 'Donnan potential' and not specifically to the electrostatic attraction between the RNA and the capsomeres [6] . In addition, in all these works the RNA was treated as a linear polyelectrolyte, ignoring its secondary structures, which were very recently shown to affect the virus electrostatics profoundly and fundamentally [9, 10, 47] .

Regarding the genome-to-tail charge ratio, the results presented in this work-based on improved and more detailed definitions of the N-tails and their charge, as well as including [16] , and do not point towards a universal scaling of the ratio in viruses with positively charged tails ( figure 6 ). They do, however, show that the lengths of the WT genomes of these viruses do not seem to exceed ∼6 knt, unlike in other viruses with similar capsid size, although some caution should be exercised since this could easily be affected by the limited amount of viruses in our dataset. In addition, it is known that viruses with positively charged tails can also pack non-native RNAs and other polymers of different lengths [48, 49] , as is the case with, e.g. CCMV. CCMV preferentially packages RNA1 of BMV instead of its own native RNA1 [50] , and is found capable of packaging polyU RNAs, which do not form secondary structures and act essentially as structureless linear polymers [8] . The polyU RNAs are packaged more efficiently than WT RNAs of equal length, and while polyU RNAs up to 5 knt are completely packaged, the resuting virions exhibit smallerthan-WT capsids. Longer polyU RNAs up to 9 knt are also packaged, but into multiplet capsids. The conclusion is therefore that RNA secondary structure (or its absence) plays an essential role in determining the capsid structure during selfassembly of CCMV-like particles, as was later rationalized in detail from RNA polyelectrolyte models taking into account the secondary structure of non-linear RNAs [9, 10] . Inherently branched RNA secondary structure appears to allow viruses to maximize the amount of encapsidated genome, and makes the assembly more efficient and the virion more stable. Nevertheless, at present there is no good rationalization for why so many viruses show absolutely no correlation between the genome and the charge on the N-tails. One could in principle invoke PSs or, in more general terms, the sequence of the viral RNA that could somehow decouple the genomic charge and the structural capsid charge, leading to the features observed in figure 6 .

Apart from the theory-driven analytical polyelectrolyte models of RNA packaging, simulations of viral assembly show that for a given type of CP and solution conditions there exists an optimal length of RNA for the assembly [1] . This is confirmed by analytical calculations in numerous works that have studied the various influences on the optimal length of the packaged RNA [6, 9, 10, [14] [15] [16] . While the analysis of generic electrostatics for a linear RNA in a charged shell is straightforward, the details of the non-uniform charge distribution of the shell and indeed of the CP-tail effect, coupled to the secondary structure of the RNA genome, are more difficult to evaluate quantitatively but appear to underpredict the optimal genome length and thus the overall overcharging of the virion [47] . In addition, calculations based on linear polyelectrolytes, rather than base-paired nucleic acids, also underpredict the optimal length of the packaged genome, additionally demonstrating the importance of the nucleic acid structure for the assembly [23] . There is therefore a growing evidence that the sequence of the viral RNA plays an important role in packaging-most probably through the coupling between the secondary structure of the RNA and its modifications when subjected to confinement in the virus shell with a complicated non-uniform charge distribution. The relative importance of RNA-RNA contacts compared to RNA-CP contacts and the detailed roles of specific and non-specific (electrostatic) interactions has yet to be determined conclusively.

The results presented in our study-and specifically, the lack of an observed universal genome-to-tail charge ratioare in part constrained by the limited dataset of viruses, even though it is several times larger than those used in previous works. In addition, the results can be influenced to an extent by different definitions of N-tails, as well as the (necessary) simplifications assumed in the calculation of charge on the AA residues of capsid proteins (see [39] for a detailed discussion of the latter). Lastly, we wish to mention a different possible source of discrepancies related to the choice of the virus dataset, and that is the structural data itself. Different database entries of CPs of the same virus can lack information both on sequence and structural level, to various degrees. In addition, experimental conditions used in determining the structure of the CP can vary-including the pH of the surrounding solution, temperature, strain of the virus, and not the least the presence or absence of genomic material in the capsid itself. It is known, for example, that the amount of structural disorder found in capsid proteins varies strikingly not only between but also within viral families [19] .

Variations in the resolved structure of a CP can in turn influence our ability to accurately asses the length and charge on the N-tails. For this reason, we included in our dataset also several different database entries of the same viral CPwhere available-and investigated the resulting variations in the properties of their N-tails. Figure S10 in the supplementary material shows the examples of two viruses, human rhinovirus and PhMV, with four different entries each. In the case of the human rhinovirus, we include three different strains, which differ slightly already in the length of the full capsid protein. Sizes of PhMV capsid protein are on the other hand the same in all four cases, yet they, too, possess tails of slightly different lengths. As a consequence, we can observe in both examples some variation in the lengths of the determined tails and in the electrostatic properties of the tails and the CPs. The variation remains much in the same range as when we compared the variation between different definitions of N-tails, and this is the case also for other examples we examined. Some entries can again, however, yield quite different results: the pI of human rhinovirus 2, for instance, is acidic, while those of human rhinoviruses 1A and 16 are basic. Such differences are limited mostly to viruses coming from different strains. Importantly, the results for duplicate entries of viruses with positively charged tails that we have examined do not differ much (see tables S3 and S4 in the supplementary material).

Our work presents two clear and complementary definitions of the flexible, disordered N-tails of viral coat proteins: the first based on the assignment of secondary structure to the proteins, the other on the predicted intrinsically disordered regions in them. We have shown that the predictions of the two definitions are comparable and, for the most part, consistent. Predicted lengths of the N-tails usually agree within 20 AA, and consequent predictions of the average charge on the N-tails are within ±2 e 0 . These differences are of the same order as the uncertainties stemming from the method used to determine the ionizable AAs and their charge [39] . And while using different database entries of the same viral coat protein usually yields comparable results, the number of different deposited capsid structures of ssRNA+ viruses remains small enough that the choice and size of the dataset still has the potential to influence quantitative predictions.

Another important improvement on previous works on N-tails that we have included in our study is taking into account the pH-dependent charge on all ionizable AAs, giving us the ability to more accurately determine the charge on the N-tails and CPs at any value of pH. This dependence shows how big of a game-changer the positively charged N-tails of some viruses can be, as they often contribute as much charge to the protein as the rest of the (structurally-ordered) CP itself, while pushing the pI of the capsid proteins to the basic range of pH. Our observations are in line with other studies showing that CP-CP interactions get weaker with increased pH, while CP-RNA interactions remain strong by virtue of positively charged N-tails [26, 51, 52] . Here, we have shown that this is a robust consequence of the fact that the pH variation of the N-tail charge is much less pronounced than the corresponding variation for the structural part of the CPs. The understanding of the pH-dependence of charge in different viruses should have important consequences for the relative interplay of CP-CP and RNA-CP interactions in them, which can be directly related to their assembly mechanisms as well as stability.

Comparing the charge on the predicted N-tails of viruses in our dataset and the lengths of the corresponding WT genomes packaged in them, we have observed that viruses which utilize positively charged tails and RNA-CP interactions in their assembly pack smaller genomes than viruses where CP-CP interactions are dominant. This observation is not related to the average size of the capsids these genomes are packaged into. Our data also indicate that there is no 'universal' genometo-tail charge ratio in viruses with positively charged tails. The mechanisms behind these observations remain unclear at the moment, but recent discoveries point toward the importance of RNA secondary structure for its packaging and interaction with the capsid proteins.

National Centre for Biotechnology Information (NCBI) Nucleotide Database

Flock house virus: a model system for understanding non-enveloped virus entry and membrane penetration Cell Entry by

ALB and RP acknowledge the financial support from the Slovenian Research Agency (research core funding No. (P1-0055)).

Anže Lošdorfer Božič https://orcid.org/0000-0001-6304-6637