key: cord-0068996-bawxmqum
authors: Menichetti, Roberto; Giulini, Marco; Potestio, Raffaello
title: A journey through mapping space: characterising the statistical and metric properties of reduced representations of macromolecules
date: 2021-10-12
journal: Eur Phys J B
DOI: 10.1140/epjb/s10051-021-00205-9
sha: 0a705d9d358c48742fbf54ccc08e1605af9eecd7
doc_id: 68996
cord_uid: bawxmqum

ABSTRACT: A mapping of a macromolecule is a prescription to construct a simplified representation of the system in which only a subset of its constituent atoms is retained. As the specific choice of the mapping affects the analysis of all-atom simulations as well as the construction of coarse-grained models, the characterisation of the mapping space has recently attracted increasing attention. We here introduce a notion of scalar product and distance between reduced representations, which allows the study of the metric and topological properties of their space in a quantitative manner. Making use of a Wang–Landau enhanced sampling algorithm, we exhaustively explore such space, and examine the qualitative features of mappings in terms of their squared norm. A one-to-one correspondence with an interacting lattice gas on a finite volume leads to the emergence of discontinuous phase transitions in mapping space, which mark the boundaries between qualitatively different reduced representations of the same molecule. GRAPHICABSTRACT: [Image: see text]

The research area of computational molecular biophysics has experienced, in the past few decades, impressive advancements in two complementary and strictly intertwined fields: on the one hand, the steadily growing and increasingly cheaper computational power has enabled the simulation of ever larger systems with atomistic resolution [1, 2] ; on the other hand, there has been an explosion of diverse coarse-grained (CG) models [3] [4] [5] , i.e. simpler representations of molecules in terms of relatively few sites interacting through effective potentials: these have filled several gaps between the length-and time-scales of interest and the current capability of all-atom methods to cover them. The scientific efforts making use of one or both these techniques have cracked several important problems open, ranging from protein folding to cell growth [6] [7] [8] .

The development of a successful CG model is strongly dependent on the choice of the reduced representation, or CG mapping, and on the correct parametrization of the effective interactions [4, 8] . The latter challenge has received an enormous amount of attention, leading, e.g. in the case of proteins, to extremely accurate and sophisticated CG potentials such as OPEP [9, 10] , AWSEM [11, 12] and UNRES [13, 14] . The former task a e-mail: raffaello.potestio@unitn.it (corresponding author) has been the object of a smaller number of works, however its centrality in and beyond the process of coarse graining has recently started to emerge [8, 15] ; indeed, a deep relationship exists between the degrees of freedom one selects to construct a CG model of the system, and those one employs to analyse the system's behaviour from a more detailed representation.

On the one hand, high-resolution, fully atomistic models are necessarily required to let the properties and behaviour of complex biomolecular systems emerge; on the other hand, the interpretation and understanding of this behaviour requires a reduction of the mountain of data and its synthesis in a smaller amount of information. In a nutshell, while the generative process has to be high-resolution to be useful, its outcome has to be low-resolution to be intelligible. An intuitive example of this concept is given by the representation of a protein structure in terms of its C α 's, i.e. the alpha carbons of the backbone: this mapping is not only extensively employed in the development of CG models [16, 17] (that is, models in which the whole amino acid is represented as a single bead whose position coincides with that of the C α ), but it is also extremely common in the analysis of structures sampled in fully atomistic simulations [18, 19] .

A few different strategies have been developed that aim at identifying the optimal CG mapping to describe a molecule, which differ most notably in the observable used to drive the optimisation. There exists a first class of algorithms that rely on a completely static, graphbased description of the system [20, 21] , such as the [20] . A second group of approaches makes use of the dynamics of the system, obtained through models with more [22, 23] or less [24, 25] detailed force fields. For instance, a recent protocol proposed by us [23] revolves around the analysis of an all-atom molecular dynamics (MD) [26, 27] simulation trajectory of a protein in terms of a subset of the molecule's atoms; a physics-driven choice of the latter allows one to identify the one or few mappings that return the most parsimonious yet informative simplified description of the system. Each of these methods can be the most appropriate to investigate specific properties of the system at hand; at the same time, the majority of them performs the search for solutions of an optimisation problem within the overwhelmingly large space of all possible CG representations that can be assigned to the system. As an exhaustive exploration of this mapping space is hardly feasible in practice, the outcome of such schemes is often a pool of optimal CG representations, that is, an ensemble of local minima in the-likely-rugged landscape of the cost function that defines the mapping optimisation procedure.

Given the complexity of this problem, several natural questions arise: how degenerate is the space of solutions? Are its elements all markedly distinct from one another, or do they only represent mostly neutral and equivalent variations of a "typical" optimal CG mapping? Furthermore, how are the solutions distributed across the space of possible mappings? How much do they differ from, e.g. randomly chosen CG representations?

To be in the position of answering these questions, the most basic instrument one needs is a meter, i.e. a tool to measure distances in mapping space and assess the degree of "similarity" among its elements in a quantitative manner. The definition of such metric should be independent of the choice of the function subsequently employed to quantify the quality of a given reduced representation, in the same manner, e.g. in which the Euclidean distance separating two particles is independent of the interaction potential acting between them.

The objective of the present work is, thus, to contribute a tool for the exploration and characterisation of the mapping space, its metric and topological properties, and the relations among its instances. To this end, we introduce a notion of scalar product, and consequently of norm and distance, between reduced representations of a system. We first make use of these instruments in the exploration of some basic, bare metric and topological properties of mappings of a single, static molecular structure, i.e. without reference to its interactions and/or conformational sampling, but rather solely considering the coordinates of its constituent atoms. This provides a notion of mapping distance based on purely geometric properties of the molecule. Through the application of an enhanced sampling algorithm, namely the Wang-Landau method [28, 29] , we characterise this mapping space in its entirety, and asso-ciate its properties to structural features of the underlying molecule. Furthermore, the isomorphism between the problem of exploring the possible mappings of a molecule and that of a lattice gas in a finite volume enables to highlight the emergence of first-order phase transitions in the latter, distinguishing CG representations with qualitatively different properties. We then investigate the topology of the mapping space making use of the distance between reduced representations, which enables a low-dimensional embedding that highlights its general features. This analysis is performed both in absence and in presence of a cost function, namely the mapping entropy [23, 24, [30] [31] [32] , which gauges the quality of a given CG representation according to an information theoretical measure. Finally, we suggest a possible manner to extend the tools we introduced to characterise the mapping space in the static case so as to incorporate information about the reference system's exploration of conformational space and thus, indirectly, about its energetics as well.

The paper is organised as follows: in Sect. 2 we develop a scalar product between decimation mappings of a macromolecular structure in a static conformation, and derive from it a notion of norm and distance in the mapping space; in Sect. 3 we study CG representations in terms of the distribution of values of the squared norm for mappings having a given number of retained sites N , first through random sampling, then making use of the Wang-Landau enhanced sampling method; in Sect. 4 we exploit a duality between the problem of mappings of a macromolecule and that of an interacting lattice gas in a finite volume to investigate the properties the molecule's reduced representations; in Sect. 5 we discuss some topological features of the mapping space; in Sect. 6 we discuss an extension of the structure-based definition of the norm that includes information about the system's energetics; in Sect. 7 we sum up the results of this work and discuss its future perspectives.

The construction of a CG model for a macromolecular system starts with the selection of a mapping M , that is, the projection operator connecting a microscopic, detailed configuration r i , i = 1, ..., n to a low-resolution one R I , I = 1, ..., N < n,

where n and N are the number of atoms in the system and the number of effective interaction sites employed in its CG simplified picture, respectively. In Eq. 1, the weights c Ii are positive, spatially homogeneous-i.e. independent of the configuration r-and subject to the normalization condition n i=1 c Ii = 1 to preserve trans-lational invariance [4] . While a particular choice of these coefficients corresponds to a specific CG representation of the system, by varying them, along with changing the degree of CG'ing N , one spans the mapping space M, whose elements are all the possible low-resolution descriptions that can be assigned to a macromolecule. In the perspective of quantitatively characterising the properties of such space, the cardinality of M in the continuous definition presented in Eq. 1 makes its thorough exploration, although appealing, hard to handle in practice. In this work, we, thus, restrict our analysis to the discrete subspace of CG representations that can be obtained for a system through a decimation [33, 34] of its microscopic degrees of freedom: a subset of N constituent atoms is retained while the remaining ones are neglected.

By selectively discarding a subset of the system's atoms, this structural simplification procedure can be applied to systems of arbitrary size or environments, e.g. molecules in solution, embedded in lipid membranes, or in the gas phase; the set of potentially retained atoms can be extended to the solvent as well, so as to expand the definition of the system under examination to include e.g. water molecules that form hydrogen bonds with the solute. Clearly, the filtering brought forward by the decimation mapping can be applied to a single configuration as well as to an ensemble of conformations sampled, e.g. in a molecular dynamics simulation. The outcome would consist, in the first case, in a single subset of coordinates, which thus preserves no information about the system other than the positions of the retained atoms; in the second case, the filtering would result in a ensemble of reduced representations whose distribution in configuration space reverberates the intra-and intermolecular interactions of the underlying high-resolution sample.

In this work, we will focus on a particular type of molecular structure, namely a protein, and restrict our analysis to decimation mappings in which retained atoms are only selected among those composing the molecule. Initially, calculations will be performed only considering the static, crystallographic configuration of the molecule; subsequently, its energetics will be explicitly accounted for in Sect. 6, where we provide some preliminary results on the extension of the theory to an ensemble of conformations generated through a molecular dynamics simulation of the protein in explicit solvent. Also in this latter case, the environment is not explicitly retained by the CG mapping-again, only protein degrees of freedom are considered-but its effects are implicitly encoded in the distributions of conformations sampled by the molecule.

Let us, thus, consider a protein composed by n constituent atoms; the number of representations Ω N that can be constructed by selecting N of them as effective CG sites is

so that the total number of possible decimation mappings Ω reads

which becomes prohibitively large as the size of the system increases. Consequently, in the following we only consider the heavy atoms of the molecule as candidate CG sites, indicating with M the subspace of mappings obtained according to these prescriptions. The investigation of the topological structure of M calls for the introduction of a distance D(M, M ), M, M ∈ M, able to quantify the "separation" between pairs of points M and M belonging to the space of decimation mappings, that is, pairs of CG representations employed to represent the system that differ in the choice of the retained atoms. Such distance must be equipped with all the associated metric properties, namely identity, symmetry, and triangle inequality.

To construct D(M, M ), we consider a static configuration of the molecule, namely the crystallograpic one, with (heavy) atoms located in positions r i , i = 1, ..., n and a set of selection operators χ M,i , i = 1, .., n defining mapping M ,

where N (M ) is the number of retained atoms in the mapping. Taking inspiration from the Smooth Overlap of Atomic Positions method (SOAP) developed by Csány et al. [35, 36] , we associate with each M ∈ M an element φ M (r) of the Hilbert space of square-integrable real functions L 2 (R 3 ) as

obtained by centering a three-dimensional Gaussianwhose normalization factor C will be fixed in the following-on the position of each atom of the macromolecule retained in the mapping.

induces a norm ||φ M || for mapping M , with

starting from which the distance D(M, M ) can be defined as 

while the associated distance D(M, M ) in Eq. 9 reads

In Eqs. 10 and 11, the coupling constant J ij = J ij (r i , r j ) between two atoms i and j is given by

with

due to translational and rotational invariance. By introducing polar coordinates in Eq. 12, one has

and a chain of Gaussian integrals provides

1 In contrast to the original definition of the SOAP measure-which enables to quantify the similarity between two molecular structures [35, 36] -we here aim at determining the overlap between different CG representations of a single compound. As such, with respect to SOAP: (i) in Eq. 6 we do not employ local densities representing the chemical environment of a specific atom (which would afterwards require, e.g. to average over all pairs of atoms for the calculation of the total similarity kernel [36] ), but rather global ones associated to the molecule as a whole; and (ii) in Eq. 7 we do not introduce an additional integral over rotations of one of the two structures. Indeed, there is no ambiguity in defining the alignment of different CG representations, as this is dictated by the original, full-atom reference.

where the last equality has been obtained by setting, without loss of generality,

Finally, by combining Eqs. 10 and 15 the inner product

i.e. a sum of Gaussian factors over the positions of all pairs of atoms retained in the two mappings. Notably, the factorization with respect to the operators χ M and χ M in Eqs. 10 and 17 enables the inner product (and therefore the distance D and the squared norm E) to be determined starting from a matrix J ij that can be calculated a priori over the static structure of the molecule.

One might ask what kind of information the previously defined quantities provide about the possible CG representations of a system. To answer this question, we first focus on the squared norm of a mapping E(M ), see Eqs. 8 and 17,

For a given retained atom i, the sum over j in Eq. 18,

approximately represents its CG coordination number, that is, the number of retained atoms in the mapping that are located within a sphere of radius √ 2σ from i. By fixing the degree of coarse-graining N , E(M ) scales as

showing that the dependence of the norm on the specific selection of atoms is dictated byZ(M ), the average CG coordination number. Let us now consider two limiting cases: (i) extremely sparse and homogeneous CG representations, in which each retained atom does not have any retained neighbour within a radius of order √ 2σ-this condition can only be fulfilled provided that N is not too large, vide infra, or σ is much smaller than the typical interatomic distance. In this case, one has Z(M ) ≈ 1 and consequently E(M ) ≈ N ; (ii) globular mappings characterised by densely populated (i.e. almost atomistic) regions of retained sites surrounded by "empty" ones. In this case, the average coordination numberZ(M ) will roughly resemble its atomistic counterpart, the latter being defined as

and thus E(M ) ≈ Nz. It follows that the squared norm E(M ) captures the average homogeneity of a CG representation, that is, whether the associated retained atoms are uniformly distributed across the macromolecule or are mainly localized in well-defined regions of it. In Fig. 1 , we report examples of CG mappings extracted for these two extreme categories in the case of adenylate kinase (see Sect. 3 for further details on this protein) together with a CG representation in which the retained atoms are randomly selected. An analogous discussion can be performed for the inner product φ M , φ M in Eq. 17, calculated between two mappings M and M that respectively retain N and N atoms of the system. For a given atom i in mapping M ,

approximately counts the number of neighbours j in mapping M located within a sphere of radius √ 2σ from i. The inner product scales as 

If furthermore the two mappings show also roughly the same "globularity",Z(M ) ≈Z(M ), their parallelism requiresT

that is, the average number of neighbors one atom of M has that belong to mapping M has to be equal to the average number of neighbors the atom has that belong to M itself. This means that the two mappings must place retained atoms across the macromolecule in a similar fashion. Examples of approximately parallel and orthogonal CG representations for adenylate kinase are presented in Fig. 1 . It follows that while E(M ) quantifies the average sparseness of a CG representation, φ M , φ Mor equivalently cos θ M,M -characterises the average degree of spatial similarity between two different decimations of the microscopic degrees of freedom of the system. The distance D(M, M ) in Eq. 11 combines these two notions to extract how "far" a pair of CG representations is in the space of possible mappings M.

Based on these observations, we implemented a slight modification to the inner product φ M , φ Mand hence to the squared norm E(M ) and distance D(M, M )-with respect to the definition originally presented in Eq. 17, which, however, does not change its overall properties or interpretation. We have previously discussed how in the limiting cases of extremely sparse and globular mappings one respectively obtains E(M ) ≈ N and E(M ) ≈ Nz, wherez is the atomistic coordination number in Eq. 22. As the number of CG sites N increases, however, it will be extremely hard for a retained site not to have any retained neighbor within a sphere of radius of order σ, so that the exact scaling of E(M ) on the degree of CG'ing N in the case of sparse mappings will be hardly observed. We thus divide the inner product in Eq. 17 by the average atomistic coordination number, and define

Consequently, one has 

For notational convenience, in the following, we will omit the subscriptz and refer to E(M )z, φ M , φ 

Starting from the definitions introduced in Sect. 2, we now proceed to perform a quantitative analysis of the high-dimensional space M of CG representations that can be constructed for a macromolecule through a decimation of its atomistic degrees of freedom. As a testbed system we consider adenylate kinase (AKE), a 214 residue-long phosphotransferase enzyme catalysing the interconversion between adenine nucleotides, namely adenine diphosphate (ADP), adenine monophosphate (AMP), and the adenine triphosphate complex (ATP) [37] . The structure of adenylate kinase can be divided in three main building blocks [38, 39] , with the mobile LID and NMP domains exhibiting a conformational rearrangement around a hinge, the stable CORE domain, which results in an overall open ↔ closed transition of the enzyme [40, 41] . Our calculations require in input only the value of the σ parameter and a static configuration r i , i = 1, ..., n of the system to determine the set of Gaussian couplings J ij in Eq. 32. We here set σ = 1.9Å (that is, half the separation between two consecutive α carbons), and rely on the open crystal conformation of adenylate kinase (PDB code 4AKE), excluding from the analysis all hydrogens composing the biomolecule, resulting in a total of 1656 heavy atoms.

The investigation of the topological structure of the decimation mapping space of AKE calls for an extensive characterisation of the relational properties among its points, achievable by analysing the behaviour of the distance D(M, M ) over an ensemble of prototypical CG representations extracted from M. The discussion carried out in Sect. 2, however, highlighted that D(M, M ) strictly depends on two factors: the globularity of each mapping-encoded in the squared norm E(M )-and their mutual spatial complementaritythat is, the inner product φ M , φ M or equivalently the cosine cos θ M,M . It is then useful to first focus on these one-and two-"body" ingredients before combining them into the distance D(M, M ). As such, in Sects. 3.1, 3.2 and 4 we will respectively discuss the behaviour of E(M ) and cos θ M,M across the mapping space of AKE; the analysis of the distance D, and hence of the topology of M, will be presented in Sect. 5.

Let us first consider the squared norm E(M ) of a CG representation M defined in Eq. 30. As previously discussed, this quantity provides information about the spatial homogeneity of a mapping with a given degree of CG'in N ; that is to say, it recapitulates how the retained atoms are distributed across the molecular structure, from uniformly scattered (E(M ) ≈ N/z) to mainly concentrated in well-defined, almost atomistic domains emerging out of a severely CG'ed background

It is important to stress that mappings belonging to the two aforementioned extreme cases are routinely employed by the CG'ing community in the description of a biomolecular system. In proteins, examples from the homogeneous class include physically intuitive, residue-based CG representations of the molecule in terms of its α carbons or backbone atoms [5, 8] ; homogeneity, on the other hand, is often abruptly broken in chemically informed, multiscale mappings, in which a higher level of detail, up to the atomistic one, is sharply localized on the biologically/chemically relevant regions of the system-e.g. the active sites of the proteinwhile the reminder is treated at extremely low resolution [8] . Furthermore, moving away from these limiting cases, an increasing attention is being posed in employing CG descriptions in which the level of detail is, although inhomogeneously, quasi-continuosly modulated throughout the molecular structure [8] .

Be they fully homogeneous, markedly inhomogeneous, or smoothly interpolating between these two classes, the CG representations that are usually adopted in the literature to simplify a biomolecule are often selected a priori by relying on general and intuitive criteria. Critically, such representations only constitute elements, isolated instances extracted from the highdimensional mapping space M of the system. One natural question follows: how representative are these "common" mappings of the diversity of the space M? In other words, how spatially homogeneous are the possible CG descriptions that can be designed for a macromolecule when no prior knowledge about its chemical structure or biological function is exploited to guide the mapping construction?

To answer this question, we start by introducing the number of mappings Ω N (E) that attain a particular value E of the squared norm for a given number of CG sites N , which is given by: (34) where O is a generic observable that depends on the mapping through the operators χ i . Normalizing Eq. 33 by the total number of mappings with N sites, Ω N , we define the conditional probability of having a mapping with norm E given that the degree of coarse-graining is N , that is

which satisfies the normalization condition

regardless of the number of retained sites. P N (E) can be rewritten as (37) where the primed sum runs over all mappings with fixed resolution N , i.e. over all values of the set of operators

By providing direct insight on the degree of spatial uniformity characterising the ensemble of all possible CG descriptions of a macromolecular system, P N (E) represents a first important ingredient in the investigation of the structure of the mapping space M. We, thus, aimed at analysing the behaviour of the conditional probability P N (E) across the decimation mapping space M of AKE for a set of 16 values of N ranging from N = 53 to 1605, see Table 1 . However, even restricted to these cases, an exhaustive enumeration of all possible CG representations of the system is unfeasible in practice: for example, in the case of AKE (n = 1656), roughly 10 276 possible CG representations can be constructed that describe the enzyme in terms of a subset of N = 214 heavy atoms (one for each residue). This number grows to 10 496 for N = 856 (four heavy atoms per residue on average), that is, close to the maximum of the binomial coefficient, obtained for N = n/2, see Eq. 2.

To overcome this combinatorial challenge, for each degree of CG'ing we generated Ω tot = 2×10 6 uniformly distributed random mappings as strings χ i , i = 1, ..., n of zeros and ones compatible with Eq. 38, and calculated the associated squared norm E. Results for each N were then binned along the E axis in intervals of δE = 0.1, and the corresponding P N (E) was estimated as

where Ω N (E) is the number of sampled mappings with squared norm falling between E and E + δE. Note that in this way we are approximately treating as continuous the intrinsically discrete, unevenly spaced spectrum of possible norms, and the density P N (E)-and consequently Ω N (E)-as piecewise constant. In this "continuous" limit, the normalization condition of P N (E) becomes

The set of distributions P N (E) obtained from our random sampling of the mapping space of AKE are displayed in Fig. 2 . We observe that, for each value of the CG resolution N , P N (E) is unimodal and narrowly peaked around its average squared norm,

E N being an increasing function of N . On the other hand, the standard deviation σ E,N ,

is non-monotonic in the degree of CG'ing: starting from extremely small values in the case of few retained atoms (e.g. N = 53, 107 and 214), σ E,N increases roughly up to N ≈ 3n/4 and then starts to decrease, reaching zero for N = n-in this case, only one possible mapping exists, namely the atomistic one. These features are further highlighted in Table 1 and Fig. 3 , in which we report the dependence of E N and σ E,N on the degree of CG'ing N as obtained from the distributions P N (E) in Fig. 2 . E N quantifies the average spatial homogeneity of the ensemble of CG representations that can be randomly assigned to AKE at a specific resolution. As pre- viously discussed, maximally inhomogenous mappings, in which a chiseled chunk of the biomolecule is treated atomistically while the remainder is almost neglected, are characterised by E ≈ N . Critically, Fig. 3 displays that such linear scaling lies always above the average E N for all degrees of coarse-graining investigated. The deviation between the two curves is non-monotonic, with a maximum obtained for N = n/2, and only vanishes for N → n, where mappings become very dense as they collapse towards the atomistic representation. As a consequence, the CG representations one encounters by randomly probing the mapping space M tend to be "sparse" rather than compact. Furthermore, the difference between the squared norm of the globular case and E N is always (but for N ≈n) one or two orders of magnitudes larger than the standard deviation of the corresponding P N (E), see Fig. 3 . It follows that inhomoge- Main plot: N -dependence of the average squared norm E N ("Random", black line) and associated 3σE,N confidence interval (khaki area) as obtained from a random sampling of the mapping space of AKE, superimposed to the region covered by the set of single-window, preliminary WL runs (purple area). The minimum ("WL-min", blue line) and maximum ("WL-max", red line) squared norms reached by the preliminary runs are highlighted. "WL-max" also corresponds to the scaling E ≈ N obtained in the case of inhomogeneous, globular mappings neous mappings lie extremely far away in the right tails of the distributions displayed in Fig. 2 , thus constituting an exponentially vanishing subset of the space M.

The suppression of the statistical weight associated with high-norm, globular CG representations of AKE in the space of all possible ones is not surprising, and is solely driven by entropic effects. Indeed, at least for small and intermediate N , it is extremely unlikely that a completely random selection of retained atoms across the biomolecule will result in their dense confinement within sharply defined spatial domains of the system, just as it is unlikely for a gas to occupy only a small fraction of the volume in which it is enclosed. Interestingly, this latter analogy can be pushed further by noting that the squared norm E(M ), see Eqs. 30 and 18, is akin to the negative configurational energy of a lattice gas living on the irregular grid defined by the protein's conformation, whose particle interact via a hard-core, short-range potential followed by an attractive Gaussian tail. In this context, the selection operators χ M,i = 0, 1, i = 1, ..., n of a mapping M with N retained atoms can be interpreted as the set of occupation numbers describing a distribution of the N particles of the gas on the n available lattice sites. It follows that compact CG representations of AKE, located in the large-E limit of P N (E), are just as challenging to randomly sample within the space M as are the lowenergy configurations of the gas in which the N particles spontaneously occupy only a fraction of the available volume. The implications of this analogy will be thoroughly explored in Sect. 4 .

The strongly entropy-driven distribution of mappings calls for the introduction of enhanced sampling tech-niques to boost the exploration of the mapping space; in this work, we resort to the algorithm proposed by Wang and Landau (WL) [28, 29, 42, 43] . For each CG resolution N , the aim is to obtain a uniform sampling of the possible mapping norms E across the space M, in contrast to the set of narrowly peaked probability distributions displayed in Fig. 2 . In principle, this is attained by setting up a Markov chain Monte Carlo simulation in which a transition between two subsequent mappings M and M -both retaining N atoms-is accepted with probability α given by [28] α M →M = min 1, (43) where Ω N (E) is the density of states defined in Eq. 35

is the corresponding microcanonical entropy. When compounded with a symmetric proposal probability π for the attempted move, π M →M = π M →M , the Markov chain in Eq. 43 would generate, after an initial relaxation transient, [28] , resulting in a flat histogram P N (E) of visited norms over the whole range of possible ones [43] .

In practice, however, the density of states in Eq. 43 is not known a priori. The power of WL approach resides in its ability to self-consistently obtain Ω N (E) through a sequence k = 1, ..., K of non-equilibrium simulations in which increasingly accurate approximationsΩ k N (E) to the exact result are generated, iterations being stopped when the desired precision is achieved [28, 29] . For the sake of brevity, we here omit an exhaustive discussion of the general algorithmic workflow behind WL sampling as well as an in-depth description of the specific implementation employed in this work; these details are provided in Appendix A.

In the WL reconstruction of a density of states such as Ω N (E), knowledge of the sampling boundaries proves extremely beneficial to the accuracy and rate of convergence of the self-consistent scheme [44] . For each degree of CG'ing investigated, we, thus, initially performed a preliminary, non-iterative WL run to approximately locate the minimum and maximum mapping norms E min (N ) and E max (N ) achievable for AKE at that specific CG resolution, and consequently bound the support of the corresponding Ω N (E).

The results for E min (N ) and E max (N ) obtained from this analysis are presented in Fig. 3 and Table 2 of Appendix A. We observe that the mapping norms visited by the set of preliminary WL runs extend, for all values of N , over a significantly wider range compared to the one obtained by random sampling. Remarkably, the maximum norm E max (N ) exhibits a linear dependence on N that is fully compatible with the one associated to globular CG representations, E max (N ) ≈ N , highlighting that the WL approach succeeds in exploring this entropically suppressed region of the mapping space. Furthermore, Fig. 3 displays that the minimum norm E min (N ) identified by the preliminary runs lies always below the average E N for all values of N . In contrast to globular mappings, CG representations living in this low E limit are maximally homogeneous, that is, retained atoms are scattered throughout the molecular structure as uniformly as possible. This class constitutes another exponentially vanishing subset of the mapping space: in the gas picture, it would correspond to the ensemble of configurations in which gas particles are regularly distributed within the available volume.

Having approximately identified the range of mapping norms achievable for AKE at each CG resolution, we subsequently moved to the determination of the associated densities of states Ω N (E) via the iterative WL scheme, see Appendix A for all technical details. Calculations were only performed for a subset of degrees of CG'ing, namely those in which the number of retained atoms N is an integer multiple of the number of residues composing the biomolecule,

To speed-up convergence of the algorithm, for each N we slightly reduced the range of norms [E min , E max ] with respect to the one predicted by the explorative WL runs, see Table 2 in Appendix A. This interval was then divided into a set of overlapping windows in which independent WL simulations were performed [29] . The resulting partial densities of states were a posteriori combined to determine the cumulative Ω N (E) up to a global multiplicative factor, or, in our case, the entropy S N (E) = ln[Ω N (E)] up to an additive constant.

WL estimates of the entropy S N (E) are presented in Fig. 4 for N = 856, while results for all the other degrees of CG'ing are reported in Fig. 12 of Appendix A. In all cases, we observe that the behaviour of S N is nonmonotonic in E, exhibiting a unique maximum as the mapping norm moves from the left to right boundary of the range of investigated ones-that is, in transitioning from extremely homogeneous to maximally globular CG representations. As Ω N (E) = exp[S N (E)], this result confirms how these two limiting classes of mappings constitute regions of exponentially vanishing size within the broad space M. At the same time, the overall shape of S N strongly depends on the degree of CG'ing: while for high N entropy profiles are nearly symmetric around their maximum, they become increasingly skewed as fewer and fewer atoms are employed to represent the macromolecule, see Fig. 12 . This asymmetry becomes apparent by performing, for each CG resolution, a quadratic expansion of S N around its maximum,

whereẼ(N ) is the norm at which the first derivative S N of the entropy vanishes, and S N (Ẽ(N )) is the corresponding second derivative-the dependence of S N and S N on E being displayed in Fig. 4 for N = 856. The accuracy of this parabolic, symmetric approximation in reproducing the exact S N over the whole E-range increases with the number of retained atoms, see Figs. 4 and 12, especially as far as the limit of high mapping norms is concerned. Finally, it is interesting to test the predictions of WL sampling against the results obtained via a completely random exploration of the mapping space. To this end, Fig. 4 and Fig. 12 include a comparison between the WL entropies S N and their random counterparts S ran N , the latter defined as S ran N (E) = ln[P N (E)] + C N , where P N (E) are the probability densities presented in Fig. 2 and the constants C N are set so that the maxima of S ran N and S N coincide. For each value of N the two profiles are in perfect agreement, thus confirming the accuracy of the self-consistent WL scheme in determining the density of states of a system. Critically, results for S ran N only extend over a very narrow range of mapping norms, centred around the valueẼ(N ) for which the maximum of the entropy is attained. It is, therefore, largely unfeasible, by randomly drawing CG representations, to exhaustively explore the mapping space M of a macromolecule. In this respect it is worth to inspect the position, on the E axis, of the C α and backbone mappings (which in AKE retain N = 214 and N = 856 sites, respectively), two reduced representations that are routinely employed for CG'ing proteins [5, 8] . These turn out to be located in the vicinity of the class of "prototypical" random ones, for which the entropy S N reaches its maximum; however, their intrinsic regularity, dictated by the position of the retained sites on the peptide chain, makes these mappings slightly more homogeneous than the random ones, see Figs. 4 and 12.

To provide a more quantitative measure of the consistency between random and WL sampling results, for each degree of CG'ing, we recalculated the average and variance of the mapping norm, see Eqs. 41 and 42, starting from the WL entropies S N . These are used to compute P N (E) making use of a saddle-point approximation of Eq. 35, namely

|S N (Ẽ(N ))| 2π

where in the last step of Eq. 45 we made use of the quadratic expansion of S N defined in Eq. 44. Within the saddle point approximation, one has E N =Ẽ(N ), E(N ) being the position of the maximum of S N , and σ E,N = |S N (Ẽ(N ))| − 1 2 : these predictions are found to be in perfect agreement with their random sampling counterparts, results being presented in Table 1 .

We now proceed to the description of the mapping space M from the perspective of the inner product between its elements. Following the same scheme of Sect. 3.1, we here focus on the cosine between mappings that are constrained to share the same resolution N , and introduce the probability P NN (cos θ) of observing a value of cos θ provided that this constraint is satisfied:

that is, the ratio between the number of mapping pairs whose cosine is equal to cos θ, Ω 2 NN (cos θ), and the total number of possible pairs Ω 2 N . We can now investigate how the average degree of parallelism between two mappings changes when considering randomly selected mappings or more peculiar elements of M.

In this section, we compare two data sets, each one containing 10 6 elements: the first one was obtained by computing the cosine between two mappings in which the retained sites were picked randomly; the second data set was instead constructed in a more sophisticated manner, making use of the WL sampling scheme to collect mappings that uniformly span the range [E min , E max ] of accessible values of E identified in the previous section. More specifically, we started a WL exploration as in Sect. 3.1 over this range and, when all the reference bins were visited at least once, we began saving a mapping every 1656 Monte Carlo moves. Mappings were saved in different macro-bins, each one covering an interval of amplitude 20 (in terms of units of E). Sampling ended when 5000 mappings were saved in each box, without considering the convergence of the WL algorithm. The data set was then generated by computing the cosine (Eq. 26) between randomly selected pairs of mappings extracted through this procedure. Importantly, the WL sampling scheme produces a pool of potentially correlated mappings, so that the chance of collecting similar elements of M cannot be excluded.

The normalised histograms of cos(θ) values obtained from the two datasets are displayed in Fig. 5a for N = 856. We observe that while the random cosine distribution displays a narrow peak around its average the WL histogram is more smeared, reflecting the increased diversity of the data set. Indeed, the latter histogram spans values that range from ≈ 1, obtained when two mappings are perfectly parallel, to 0.457, when two mappings are as orthogonal as possible given the properties of the lattice and the selected number of retained sites. In Fig. 5a , we also report a graphical rendering of the two maximally orthogonal mappings, which possess a high value of E (E = 847.32 and E = 843.82, respectively) and cover different regions of the enzyme's structure.

In Fig. 5b , we extend these considerations to different values of N , namely those employed in Sect. 3.1. The random distribution is always confined in a narrow interval of values of cos θ, while WL data sets are capable of spanning a much wider range. In particular, for sufficiently small values of N , it is possible to retrieve maximally parallel (cos θ = 1) and maximally orthogonal (cos θ = 0) mappings inside the WL dataset. Reaching orthogonality is made possible by the fact that, at such low values of N , it is possible to confine retained sites in two separate regions of the protein structure. 

As anticipated in Sect. 3, the reduced representation discussed in the present work, in which a mapping is defined in terms of a decimation of the atoms available on the molecular structure, suggests the analogy with a lattice gas. Also in this case, in fact, we have a number n of nodes that can be occupied by N ≤ n sites, each node being accessible to a single site at a timethus implementing a hard-core repulsion. This analogy is a classic of statistical mechanics, and enables one, e.g. to map an Ising model to a gas of interacting particles, thus making it manifest that the spontaneous magnetisation in the former and the liquid-gas phase transition in the latter belong to the same universality class [45] . Here, we investigate the consequences of the lattice gas interpretation of reduced representations in order to tackle the issue of characterising the mapping space from a different perspective. Specifically, we mutuate concepts from equilibrium statistical mechanics to show that sharp transitions can occur that separate one or more phases corresponding to classes of reduced representations endowed with markedly distinct structural properties. While the previously performed analysis of the smooth and continuous densities of states Ω N (E) already suggested the existence of such classes, see Sect. 3, for particular numbers of retained sites these are shown to be as distinct as two or more phases of a fluid can be when observed through the perspective of this statistical mechanical analogue. The role of the energy can be played by the norm of the mapping: in analogy with a lattice gas, we expect that if two retained sites are close to each other, they feel an attractive interaction, thereby reducing the energy. We thus define the energy of the system as

In the previous sections, we obtained the density of states in terms of the mapping norm, Ω N = Ω N (E). Making use of Eq. 47 we can, thus, write

Let us now consider a system governed by the lattice Hamiltonian in Eq. 47 at equilibrium with a reservoir at temperature T = β −1 . The partition function of such system can be expressed in terms of Ω N (E) via

where we used the relation S N (E) = ln Ω N (E) to define the entropy. Equation 49 enables us to compute the dimensionless Helmholtz free energy as

While the logarithm of the integral can be theoretically and numerically cumbersome to compute, it is possible to obtain a reasonable estimate of βF N through a saddle point approximation. Specifically, we can expect that the integral is approximately equal to the largest integrand, so that

where C is an immaterial constant. This approximation provides us with a definition of the free energy that is Fig. 6 Heat capacity CV (red circles, right ordinate) and value of the energy E corresponding to the minimum of the free energy (blue triangles, left ordinate) as functions of the inverse temperature β for the system with N = 214. E decreases monotonically with β, indicating that higher temperatures correspond to higher values of the average internal energy of the lattice gas, as expected; however, a jump discontinuity in E appears in correspondence of the same value β gl for which the heat capacity features a sharp peak, suggesting the occurrence of a first-order phase transition that separates two distinct phases: a gas (low β) from a liquid (high β) for the lattice gas model, and, correspondingly, a sparse phase from a dense, localised phase in the case of mappings equivalent to the Legendre-Fenchel transform:

The thermodynamics of the lattice gas at thermal equilibrium can thus be retrieved computing Eq. 52 for a given value of N at all values of β.

It is particularly instructive to investigate the temperature dependence of E , defined as the value of the energy for which βE − S(E) reaches its minimum. In Fig. 6 (blue curve, left ordinate), we report this function for N = 214: it is possible to observe that E = E (β) decreases monotonically, i.e. the lower the temperature, the lower the value of the energy-which corresponds to higher values of the mapping norm. At a particular value β gl of the inverse temperature, however, E drops abruptly: in this context, such behaviour is suggestive of a first-order, discontinuous phase transition.

To gain further insight, we computed the shapes of βE − S(E) for values before and after β gl . These functions, reported in Fig. 7 , indeed show two minima separated by a relatively low barrier; increasing β, the absolute minimum shifts from the right to the left, crossing a point for which the two are essentially degenerate. This is the point of coexistence of two distinct "phases" of our lattice gas: a low density one corresponding to distributed mappings (high energy), and one ascribable to more dense, compact conglomerates of sites (low energy). The critical nature of the transition from one regime to the other is confirmed by the Fig. 7 Helmholtz free energy βF of the lattice gas as a function of the energy for different values of the inverse temperature β. For low values of β the curves have a unique and absolute minimum; however, as β increases, a metastable minimum appears that, for a particular value of the inverse temperature, becomes degenerate with the previous one. The presence of a small but appreciable barrier between the two minima makes the position of the absolute minimum, E , shift abruptly from one to the other, as can be seen in Fig. 6 , thus making E (β) discontinuous inspection of the heat capacity, computed as

and reported in Fig. 6 (red curve, right ordinate). The sharp, asymmetric peak in C V , located at the value β gl of the inverse temperature, shows that the lattice gas crosses a phase transition between a gas and a liquid phase. A crucial role in this behaviour is played by the number of coarse-grained sites. In fact, as N increases, the system acquires the possibility of crossing a second phase transition: for example, in the case of N = 1070, besides the gas-liquid one, it is possible to observe a second, even sharper discontinuity in E for a value of the inverse temperature β ls > β gl . This temperature separates the liquid from the solid phase: when the lattice gas particles are sufficiently many, and the temperature sufficiently low, the system can "freeze" in particularly dense mappings with very low entropy. Also in this case, the inspection of the heat capacity ( Fig. 13 in Appendix B) supports the interpretation of this as a phase transition. Finally, if the number of sites is too large (e.g. N = 1498) no transition is observed, see Fig. 13 .

The observations reported in this section resonate with those made by Foley and collaborators in a recent work [15] : there, they observed a phase transition in a system whose degrees of freedom were the retained sites of a reduced model of proteins. In that case, the energy of a given mapping was obtained from the calculation of the spectral quality of the associated model, a quantity related to the sum of the eigenvalues of the covariance matrix obtained integrating exactly a Gaussian network model (GNM). While apparently very distinct, the spectral quality and the norm of the mapping might bear substantial similarities: in fact, the former entails information about a very simple model, whose mechanical and thermodynamical properties are completely determined by the contact matrix of the underlying protein structure. It is, thus, reasonable to guess that the mapping norm provides, in an effective and efficient manner, information akin to that entailed in the spectral quality about the sparsity or localisation of the retained sites in a given mapping. If and up to which degree these two quantities are related, and how intimately this relation depends on the Gaussian nature of the GNM, requires further investigations that will be the object of future studies.

In conclusion of this section, we note that the observed phase transitions separate mappings so structurally diverse that they can be associated to qualitatively different phases. It is, thus, natural to wonder if and how these phases are organised in the metric space induced by the norm of the mapping, and what information the exploration of the latter can bring about the system it is applied to. To provide an answer to these questions, the next section is devoted to the topological characterisation of the mapping space.

In the previous sections, we analyzed the mapping space M in terms of the mapping norm E and of the cosine between its constituent elements. Here, we discuss the distance D (Eqs. 11, 31) between members of M with the aim of showing, once again, that a peculiar choice of retained CG sites, i.e. one impossible to obtain with random sampling, displays non-trivial statistical properties that reflect in the topological organization of the mapping space.

Without loss of generality, 2 we restrict our investigation to the case N = 214, namely the number of amino acids of adenylate kinase. We generated a data set of mappings following the protocol explained in Sect. 3.2; in this case, the range of values of E was narrower and only 10 macro-bins of amplitude 20 were explored. The data set was constructed by randomly selecting 100 elements for each of the macro-bins, resulting in 1000 CG mappings that homogeneously span the accessible values of E.

The sketch map algorithm [46, 47] was employed to embed 1000 points from the high-dimensional space of mappings M into a two-dimensional plane, at the same time preserving as faithfully as possible the relative distances among them-that is to say that nearby points in the mapping space are mapped onto nearby points on the 2D space, see Fig. 8 . The two critical parameters of the algorithm are σ d and σ D , which modulate how far and close points are in the low and high resolution space, respectively [46] . To provide the reader with a feeling of the impact that these parameters have on the structure of the low-dimensional representation, we report the embeddings obtained for a low (Fig. 8a) and high (Fig. 8b) value of σ d and σ D .

In the first case, presented in Fig. 8a and referring to low values of the σ parameters, data points are in general very sparse and uniformly distributed on the plane, with the exception of a group of points that accumulate in a denser cluster: these are particularly compact mappings localised in a specific region of the molecule. Such mappings remain close to each other even when the σ parameters are increased, thus "squeezing" all points in the low-D embedding, see Fig. 8b . At the same time, we observe that, in this latter scenario, points corresponding to low-E, uniform mappings collapse in a small region of the embedding space. Furthermore, a third group of points corresponding to compact mappings appears, distinct from the ones previously discussed, and absent in the low-σ embedding.

The high-σ embedding, thus, highlights two relevant features: first, the presence of specific regions with qualitatively distinct mapping properties; these are either sparse, but necessarily similar one to the other (Fig. 8d) , or dense, with atoms localised in different domains of the molecule (Figs. 8c, e) . The distance among the latter is necessarily large, since the retained sites cover non-overlapping regions.

The second relevant feature is that different groups of points, associated to qualitatively distinct types of mappings, can be connected one to the other only "passing through" a third one, as in the case, see Fig. 8 , of mapping c going going to e through d. This is suggestive of the presence of routes in mapping space that join points having the same value of the norm, which, however, cannot be connected through "iso-E" paths: to transform mappings such as that in c into that in e through a sequence of single-site changes (i.e. one retained atom is discarded, a formerly discarded one is now retained) one cannot but increase or decrease the value of the norm.

While the mapping norm E can be employed to investigate the structure of M itself, the quality of a CG representation can be determined by means of an appropriate cost function. One such function is, e.g. the mapping entropy S map [23, 24, [30] [31] [32] , which is a measure of the intrinsic information loss that is inherent to the process of dimensionality reduction operated by a mapping. This quantity is defined as where p r (r) ∝ exp(−βu(r)) is the Boltzmann weight associated to the atomistic configuration r, whilep r (r) represents the "smeared" weight of r upon coarsegraining the system by means of a CG mapping M . More specifically, one introduces the probability of sampling the CG configuration R, given by

where M(r) is the projection operator defined in Eq. 1, as well as the number of high-resolution microstates r mapping onto it,

The probabilityp r (r) is then defined as [31] p r (r) = p R (M(r))/Ω 1 (M(r)).

Critically, while both p r (r) andp r (r) are functions of the atomistic coordinates, they differ in assigning the probability to a given high-resolution configuration or microstate, in thatp r (r) associates the same probability with all microstates that map onto the same macrostate R. Minimising the mapping entropy S map in the space of possible CG representations of the system thus implies maximising the consistency between the reconstructed probability distributionp r (r) and the all-atom one. In Ref. [23] we derived an approximate expression for Eq. 54, which allows one to compute this observable provided a set of configurations and their energies are available, e.g. sampled from the canonical ensemble by means of a MD simulation:

where (u − u β|R ) 2 β|R is the variance of the energies of the atomistic microstates mapping onto macrostate R.

While the norm E depends only on the geometric properties of a single protein conformation, S map is calculated from an ensemble of configurations sampled according to the Boltzmann distribution; S map (M ), thus, contains more information than E(M ), since it makes explicit use of the average structural and thermodynamical properties of the system.

Here we employ a data set of 1968 CG mappings of AKE with N = 214 generated by us in a previous work [48] and covering a wide range of values of S map ; the relations among these mappings are then quantified in terms of their distance D, taking the enzyme crystal structure as a reference. With respect to this, it is worth keeping in mind that D intimately depends on this reference, and mappings that lie close to each other when a given structure is employed might turn out to be closer or further away from each other when a different conformation is used. Figure 9 shows that the two-dimensional embedding obtained through the application of the sketch map algorithm separates the CG mappings according to a gradient of S map . In particular, the x component of the sketch map and the mapping entropy S map display a clear anticorrelation. The results suggest that highly informative mappings, characterised by low values of S map , share geometrical features that are not present in less informative (high S map ) representations. In other words, the peculiar resolution distribution found in low-S map mappings separates them from the other elements of M. The relevant features that the mapping entropy highlights thus reverberate in the merely structural characterisation provided by the mapping distance; this connection among the norm E, the distance D, and highly informative representations is potentially interesting and deserves to be further investigated.

Insofar, our analysis of the mapping space has relied on a definition of a scalar product between CG representations based on a single, static structure of the reference protein. Proteins and other biologically relevant macromolecules, however, are not static objects, but rather flexible entities which, in a typically aqueous environment, undergo fluctuations and deformations. It is therefore natural to extend our metric to incorporate such structural variability; in this Section, we will, thus, present some preliminary results obtained by performing such an extension, restricting, for the sake of brevity, the discussion to the case of the mapping norm E.

We assume our high-resolution (i.e. atomistic) system, constituted by the protein (whose atomic coordinates are indicated with r) and its environment (indicated with s), to be subject to an interaction potential u(r, s). In the canonical ensemble the probability density to sample a given configuration is proportional to the Boltzmann weight, that is,

where Z = drds e −βu(r,s) is the configurational partition function of the system. The norm E of a mapping in Eqs. 30 and 8 only depends on a single conformation of the molecule under examination; however, one can straightforwardly extend the definition of E-and analogously of the scalar product and the distance between mappings-to account for the whole conformational space sampled by the system, in that the canonical average of the norm is taken:

Note that the average is carried out both over the protein and environment degrees of freedom; at the same time, for mappings that only retain protein degrees of freedom, the couplings J ij -and thus the norm E-only depend on the latter. The linearity of the norm with respect to the couplings allows one to first compute their thermal average, that is,

and subsequently employ them for the calculation of norms, scalar products, and distances, in the same manner as it was done insofar. In this case, however, the resulting values entail information about the conformational space sampled by the whole system, including the environment, described in terms of a high-resolution model.

To investigate the effect that accounting for the conformational variability of the system has on the norm of a mapping, Fig. 10 displays a comparison between the value of E computed on the crystal structure of AKE and its canonical average E obtained through molecular dynamics sampling. Each point in the plot represents a E-E pair out of 5 × 10 4 mappings with N = 214 extracted so as to homogeneously span all the possible values of E, see Sect. 3.2. The ensemble average is performed over 10 4 configurations of a 200 ns long NVT simulation, the technical details of which are available in the SI of Ref. [23] .

Interestingly, points are very narrowly dispersed along the diagonal, with a Pearson correlation coefficient very close to unity. This suggests that, at least in this case, the canonical average of E is robust to structural changes: we ascribe this behavior to the fact that at the outset of the simulation the protein is in its native state and, due to the strong constraints present in the molecule, the local environment of each atom generally performs small-amplitude fluctuations about a well-defined average. In this particular case, the couplings computed explicitly accounting for the energetics of the system do not induce significant deviations in the value of the norm with respect to their static-structure counterparts it is hence reasonable to expect that the same will hold for the metric and topological properties of the mapping space discussed insofar.

However, this consistency will not be observed when secondary and tertiary structures heavily change, as, e.g. in the case of protein folding: the value of E calculated over the unfolded polypeptide chain will not match its canonical average performed over a sample containing folded, more globular configurations. A more detailed understanding of how equilibrium sampling can change the metric properties of the mapping space-especially in the presence of large-amplitude conformational rearrangements-is required, and will be the subject of future work. 

In this work we have addressed the problem of defining a measure to quantify the distance between two lowresolution representations of a macromolecule, and to "explore" the metric space induced by it. The recent advances in the computational investigation of soft and biological matter have provided us with the tools to perform large-scale simulations of large and complex systems; however, due to the sheer size of the data produced, one has to filter out the large amount of detail with which the system is described [8] , thus relying on a coarse-grained description of it.

Decimation mappings offer a simple and intuitive way of applying this filter, in that only a subset of a molecule's atoms is retained; however, not all mappings entail or deliver the same amount of information, and the identification of the most informative ones allows one to highlight relevant properties of the system. Various methods have been devised [20] [21] [22] [23] [24] [25] to identify the most informative mappings as the solution to an optimisation problem, which thus relies on the definition of an appropriate cost function. Since the landscape induced by the latter is typically a rather rugged one, as it is often the case in the field of complex systems [49, 50] , it is to be expected that more than one "optimal" solution will be found. Hence, to understand the relationship among such solutions, as well as between structural representation and physical properties in general, it is of fundamental importance to possess an instrument to measure the difference, or distance, among mappings.

The metrics proposed here, which builds on the SOAP measure proposed by Csány and coworkers [35, 36] , has been employed to quantify the number, dissimilarity, and structural features of different mappings of a macromolecule in a static conformation, thereby providing the basis for quantitative analysis of the aforementioned relationship.

The exploration of the mapping space relied on the application of the Wang-Landau enhanced sampling algorithm [28, 29] , which allowed us to compute the (logarithm of the) density of states for mappings with a given number of CG sites, as a function of their squared norm. On the one hand, these calculations brought to the surface information about "special" (i.e. atypical) representations that, just due to their lower number with respect to randomly sampled ones, are exponentially suppressed; on the other hand, we made use of the densities of states to implement a lattice-gas analogy in terms of which we have interpreted mappings of qualitatively different types as different phases of the same physical system undergoing a phase transition. We have then made use of the distance between mappings to investigate the properties of optimal reduced representations obtained by minimising the mapping entropy, a measure of the amount of information that a given mapping can return about the underlying system at thermal equilibrium: this last analysis has shown that optimal mappings are markedly distant, and therefore qualitatively different, from randomly sampled ones, thus corroborating the idea that the former belong to a particular subregion of the mapping space endowed with nontrivial properties. Finally, we proposed a possible extension of the theory to samples of conformations at thermal equilibrium, focusing on the case of the mapping norm. In this manner, the J ij couplings are weighted with the probabilities associated to each configuration, thus indirectly accounting for the system's energy.

A number of questions remain open, which could not be addressed in this work. As a first thing, in the application of the theory to the system under examination we have observed a substantial consistency between the values of the mapping norm computed with singlestructure couplings and their averaged counterparts; however, it is reasonable to expect that this won't be a general case. Consequently, the inclusion of the reference system's conformational variability might lead to interesting outcomes in the analysis of structural and topological properties of the mapping space. This relevant line of research is currently under investigation.

A second open issue, partially related to the former, is that the phase transitions that we observed are analogous to the ones reported in a previous work [15] , where explicit reference to the molecule's free energy was made: this connection might entail important insights in the relationship between the properties of mappings as elements of the mapping space and the functional characteristics of the underlying system, and it certainly deserves to be further inspected.

Finally, the general character of the tools developed in this work make them suitable to be easily combined with other methods. For example, they can be employed to quantitatively gauge similarities and differences among the solutions to the mapping optimisation problem obtained making use of the various protocols proposed in the literature; or to boost the accurate determination of cost functions profiles, whose computation can be accelerated by a preliminary exploration of the mapping space followed by a cycle of biased enhanced sampling simulations [48] .

In conclusion, the mathematical, biophysical, and computational methods developed and applied in this work have served to start gathering the treasure of information buried in the relationship between how we look at a system and the properties it is endowed with, of which we think that what has been reported here has just scratched the surface. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

For the set of selected degrees of coarse graining N reported in Table 1 , the corresponding density of states ΩN (E) defined in Eq. 33-that is, the number of possible CG representations in the mapping space M that retain N atoms and have a squared norm of E-was determined by relying on the protocol proposed by Wang and Landau (WL) [28, 29, 42, 43] .

WL sampling enables to self-consistently determine ΩN (E), or, for computational convenience, the associated entropy SN (E) = ln[ΩN (E)], through a sequence k = 0, ..., K of nonequilibrium Monte Carlo (MC) simulations that provide an increasingly accurate approximation to the correct result [28, 29] . Given a partition of the ensemble of possible norms E in bins of width δE, the pivotal ingredients of the WL iterative scheme are, respectively: (i) the MC estimate of the entropySN (E); (ii) the histogram of visited norms at iteration k, H k N (E); and (iii) the modification factor ln(f k ) governing convergence of the algorithm-for k = 0, one typically setsSN (E) = 0 and ln(f0) = 1.

At the beginning of each iteration k, the histogram H k N (E) is set to zero. Subsequently, a series of MC moves is performed in which a transition between two mappings M and M , respectively with norms E and E , is accepted with probability, see Eq. 43,

In our case, both mappings have N sites but differ by the retainment of a single atom. If the move M → M is accepted, the histogram H k N and entropySN are updated according to

while in case of rejection one has to replace E with E in Eqs. A2 and A3. As highlighted by Eqs. A1 and A3, the early stages of the WL scheme tend to "push away" the sampling from already visited regions of the mapping space, thus significantly boosting its exploration compared to randomly drawing CG representations. The algorithm then evolves to generate a "random walk" in the space of possible norms [43] . The series of MC moves within iteration k is interrupted when the histogram of sampled norms H k N (E) is "flat", meaning that each of its entries does not exceed a threshold distance from the average of the histogram H k N . A typical requirement is p flat × H k N < H k N (E) < (2−p flat )× H k N for every value of E, p flat being a predefined flatness parameter. When the flatness condition is satisfied, iteration k + 1 of the algorithm begins with a reduced modification factor-in our case, we set ln(f k+1 ) = 1 2 ln(f k ). Finally, iterations over k are stopped when ln(f k ) < ln(f end )

1, ln(f end ) being another control parameter provided in input to the WL protocol. Up to an additive constant, the MC estimate of the entropySN (E) reproduces the exact result SN (E) with an accuracy of order ln(f end ) [51] .

In WL sampling, knowledge of the boundaries of the domain of the density of states ΩN (E) (equivalently, of the entropy SN ) plays a crucial role in the convergence the iterative scheme, e.g. for checking the flatness of the histogram HN (E) throughout the simulation [44] . In contrast to "more traditional" systems such as Ising ferromagnets on a lattice [28] , this information is not readily available in our case. As such, we initially performed a set of explorative, non-iterative-i.e. without updating the modification factor ln(f k )-WL runs so as to approximately locate the minimum and maximum norms Emin(N ) and Emax(N ) achievable at each degree of CG'ing. To mitigate the effect of bins that are only visited at a very late stage of the simulation, thus risking to temporarily "trap" the mapping space exploration, we followed the protocol described in Ref. [44] : every time a bin [Ei, Ei + δE] was populated for the first time, it was marked as "visited", the corresponding entropy was initialised to the minimum ofSN (E) over the previously visited bins, and the histogram HN (E) was reset. The results obtained from these preliminary runs for Emin(N ) and Emax(N ) as a function of N are displayed in Fig. 3 and summarised in Table 2 .

Having identified the range of possible norms for each investigated degree of CG'ing, we subsequently moved to the determination of the corresponding entropies SN (E) via the iterative WL scheme. To boost convergence of the algorithm, for each N we slightly reduced the interval of norms [Emin(N ), Emax(N )] with respect to the one predicted by the explorative runs, and divided this spectrum in a total of WN overlapping windows of equal width, see Table 2 [29] . The overlap between two consecutive windows was fixed to half their size. Within each window, we then performed a separate WL simulation in which confinement of the range of norms was achieved by rejecting all mapping moves M → M that would bring the exploration outside the E interval of interest. In discarding these moves, we concurrently updated the histogram and entropy of the current state according to Eqs. A2 and A3 to avoid boundary effects [52] . Furthermore, also in these production runs we kept track of the norm bins that were sampled during the course of the simulation, resetting the histogram every time a new bin was populated, the entropy of which was initialised to the minimum ofSN (E) over the previously visited ones. All Table 2 Lower and upper bound of the norms Emin and Emax identified by the set of preliminary WL runs for each degree of CG'ing N , see Fig. 3 , and corresponding valuesĒmin andĒmax employed in the reconstruction of the entropy SN (E) through the iterative WL scheme. For boosting convergence of the algorithm, the interval [Ēmin,Ēmax] was divided in WN windows overlapping by half their width; the associated simulations were performed with a flatness parameter p flat = 0.90, assuming convergence of the iterations when the modification factor ln(f k ) became smaller than ln(f end ) = 10 −6 WL simulations were performed setting p flat = 0.90, and checking the histogram flatness over the visited bins every 3 · 10 6 "single spin" MC moves that involved the swap of a retained and a non-retained atom in the mapping. We interrupted the iterative scheme when the modification factor ln(f k ) became smaller than ln(f end ) = 10 −6 . For each degree of CG'ing N , the outcome of the converged WL protocol is a set of entropiesSN,i(E), i = 1, ..., WN , restricted to bounded and overlapping E domains that need to be combined to provide the complete SN (E) over the whole range of investigated norms. TheseSN,i(E) differ-besides numerical uncertainties that are inherent to the self-consistent scheme [53] -from the exact results SN,i(E) by additive constants CN,i that are not uniform across the different WL windows. Rather than determining the relative shifts that most accurately superimpose the var-iousSN,i(E) profiles within the overlapping regions-see e.g. Ref. [42] -in this work, we directly considered the (numerical) derivatives ofS N,i (E) in each WL window,

where T is the "temperature" of the system. These derivatives are not affected by the constants CN,i, so that each S N,i (E) is approximately equal to its exact counterpart S N,i (E). One can thus combine all the derivatives of the different WL windows in a global derivative S N (E) that extends over the whole range of analysed norms, from which the overall entropy SN (E) can be calculated as

where Emin(N ) is the lowest norm sampled at degree of CG'ing N . Note that in contrast to systems as the ferromagnetic Ising model, we do not a priori know the value of SN (Emin), so that the entropy SN (E) will be only determined up to a constant.

To merge the set of derivatives and reconstruct S N (E) for each degree of CG'ing, we first applied a Savitzky-Golay filter [54] to the WL estimates of the entropiesSN,i(E) so as to reduce the amount of noise in the simulation results, and consequently smoothen the derivative S N,i (E) of each window. A comparison of the derivatives obtained in presence or absence of the filter, see Fig. 11 , highlights how this only applies a tiny correction to the original data, which nonetheless significantly improves the quality of the set of S N,i (E).

Despite this refinement, the presence of residual numerical fluctuations leave room to a certain degree of arbitrariness in how, within the overlap region of two consecutive windows, the combined derivative should be constructed. At the same time, these fluctuations appear to be marginal in the vicinity of the center of a window, while tend to slightly increase if we move towards its boundaries (data not shown). Exploiting this observation, we thus tackled the problem of merging the derivatives of two consecutive windows i and i+1 within their overlap region as follows: first, we divided the region in three separate intervals, the central one being roughly double the size of the other two. Given that the windows overlap by half their width, it follows that the first interval will be located close to the center of window i, where the derivative S N,i (E) is numerically more stable, but close to the boundary of window i + 1, where S N,i+1 (E) is slightly more noisy. The opposite holds for the last interval. As such, in the first and last regions, we considered the combined derivative S N (E) to be equal to S N,i (E) and S N,i+1 (E), respectively. Within the central interval, by increasing E we move from the vicinity of the center of window i to that of window i+1. In this latter region, we, thus, set the final derivative S N to a weighted average of the derivatives of the two windows, namely S N (E) = (1 − α(E)) S N,i (E) + α(E)S N,i+1 (E), (A6)

where α(E) is a mixing parameter that linearly increases from zero to one as E moves from the left to the right boundary of the interval.

Repeating this interpolation for all the set of WN windows-note that in the first (resp. last) half of the first (resp. last) window no mixing applies-provided us, for each of the analysed degrees of CG'ing, with a global derivative S N (E) that extends over the whole range of sampled norms. Figure 11 displays a comparison between S N (E) and the original, piecewise derivatives for the case N = 856, highlighting the accuracy of our approach. This accuracy is further confirmed by the smooth behavior of the second derivative S N (E) calculated from the reconstructed S N , that we display in Fig. 4 for N = 856. Starting from the set of S N (E), the corresponding entropies SN (E) were subsequently obtained via direct integration, see Eq. A5, producing the profiles presented in Fig. 4 and in Fig. 12 . In these figures, entropies were shifted so that their minimum value is zero. Finally, it is interesting to test the dependence of our results on the input parameters of the WL protocol. While initially all MC simulations were performed with a flatness condition p flat = 0.90, for the case N = 856, we repeated the calculations using p flat = 0.95 finding a perfect agreement of the reconstructed S N (E), see Fig. 11 . The same sensitivity analysis was performed for the bin size δE dictating the discretisation of the mapping norms: while in all simulations we employed δE = 0.2, by repeating the calculations for N = 856 with a bin width of δE = 0.5 we again observed excellent agreement of the results, see Fig. 11 . Fig. 12 Behavior of the entropy SN (E) of AKE for different degrees of CG'ing. For each N , we report results obtained via (i) Wang-Landau sampling ("WL", red dotted lines), shifting the data so that the minimum of SN over the range of investigated norms is zero; (ii) a saddle-point approximation of the WL predictions ("SP-approx", orange dashed lines); and (iii) a random drawing of CG representations ("Random", black lines), in this latter case shifting the curve so that its maximum coincides with the one of the corresponding WL profile. For N = 214 (resp. N = 856) we further report the squared norm associated to the Cα (resp. backbone) mapping ("Chem", blue dashed line), a CG representation that is routinely employed while CG'ing a protein system [5, 8] Fig. 13 Dependence of the heat capacity CV on the inverse temperature β for the lattice gas analogue of the mapping norm of AKE calculated at several degrees of CG'ing. Sharp peaks in CV at high (resp. low) values of N suggest the presence of a solid-liquid (resp. liquid-gas) transition in the system. It should be noted that the scales of β and CV are the same in all plots 

Atoms to phenotypes: molecular design principles of cellular energy metabolism

Sars-cov-2 simulations go exascale to predict dramatic spike opening and cryptic pockets across the proteome

Coarse-grained molecular simulations of large biomolecules

Perspective: Coarse-grained models for biomolecular systems

Coarse-grained protein models and their applications

Coarse-grained models of protein folding: toy models or predictive tools?

Bottom-up coarse-grained modeling of dna

From system modelling to system analysis: the impact of resolution level and resolution distribution in the computer-aided investigation of biomolecules

A coarsegrained protein force field for folding and structure prediction

The opep protein model: from single molecules, amyloid formation, crowding and hydrodynamics to dna/rna systems

Awsem-md: protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing

Protein folding and structure prediction from the ground up: the atomistic associative memory, water mediated, structure and energy model

A unified coarse-grained model of biological macromolecules based on mean-field multipole-multipole interactions

Scale-consistent approach to the derivation of coarse-grained force fields for simulating structure, dynamics, and thermodynamics of biopolymers

Exploring the landscape of model representations

Topological and energetic factors: what determines the structural details of the transition state ensemble and "en-route" intermediates for protein folding? an investigation for small globular proteins

Anisotropy of fluctuation dynamics of proteins with an elastic network model

How fast-folding proteins fold

Computational studies of sars-cov-2 3clpro: Insights from md simulations

Graph-based approach to systematic molecular coarse-graining

Graph neural network based coarsegrained mapping prediction

Coarse-graining autoencoders for molecular dynamics

An information-theory-based approach for optimal model reduction of biomolecules

The impact of resolution upon entropy and information in coarse-grained models

Optimal coarse-grained site selection in elastic network models of biomolecules

Studies in molecular dynamics. i. general method

Molecular dynamics simulations of biomolecules

Determining the density of states for classical statistical models: a random walk algorithm to produce a flat histogram

Efficient, multiple-range random walk algorithm to calculate the density of states

The relative entropy is fundamental to multiscale and inverse thermodynamic problems

Coarse-graining entropy, forces, and structures

Systematic coarse-graining of potential energy landscapes and dynamics in liquids

Notes on Migdal's recursion formulas

Renormalization, vortices, and symmetry-breaking perturbations in the two-dimensional planar model

On representing chemical environments

Comparing molecules and solids across structural and alchemical space

Adenylate kinase motions during catalysis: an energetic counterweight balancing substrate binding

Small and large scale conformational changes of adenylate kinase: a molecular dynamics study of the subdomain motion and mechanics

Coarse-grained description of proteins' internal dynamics: an optimal strategy for decomposing proteins in rigid subunits

Domain mobility in proteins from nmr/srls

Energetics and structural characterization of the large-scale functional motion of adenylate kinase

Generalization of the Wang-Landau method for offlattice simulations

Control of accuracy in the Wang-Landau algorithm

The hp model of protein folding: a challenging testing ground for Wang-Landau sampling

Statistical Mechanics

Simplifying the representation of complex free-energy landscapes using sketch-map

Demonstrating the transferability and the descriptive power of sketchmap

A deep graph network-enhanced sampling approach to efficiently explore the space of reduced representations of proteins

Landscape paradigms in physics and biology: Introduction and overview. Phys. D Nonlinear Phenom

Rugged Free Energy Landscapes: Common Computational Approaches to Spin Glasses, Structural Glasses and Biological Macromolecules

A new approach to Monte Carlo simulations in statistical physics: Wanglandau sampling

Avoiding boundary effects in Wang-Landau sampling

Fast algorithm to calculate density of states

Smoothing and differentiation of data by simplified least squares procedures

The authors thank Giovanni Mattiotti and Virginia Agostiniani for a critical reading of the manuscript and useful comments. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 758588).

RP and RM elaborated the study. RM and RP developed the scalar product between mappings. RM developed the software for the mapping norm calculations and the Wang-Landau sampling. MG produced the sketch maps. RM and MG performed the analysis. All authors contributed to the interpretation of the results and the writing of the manuscript.

In this Appendix we provide additional information about the phase transitions observed in the lattice gas analogue of the mapping norm. Specifically, Fig. 13 displays the heat capacity CV of the lattice gas, see Eq. 53 in the main text, as a function of the inverse temperature β for different degrees of CG'ing N , calculated from the Legendre-Fenchel transform βFN (β) of the WL entropies SN (E), see Eq. 52. While for the highest degree of CG'ing investigated, N = 1438, the heat capacity has a smooth dependence on β, for N = 1284 CV develops a sharp peak for low temperatures, which suggests the presence of a solid-liquid transition in the system. By further decreasing the degree of CG'ing, this solid-liquid peak gets initially flanked by a shoulder located at higher temperatures, and finally disappears. The shoulder, on the other hand, grows in magnitude as fewer and fewer sites are retained, and becomes a discontinuity for N = 214, suggesting the appearance of a liquid-gas transition.