key: cord-0957615-2h5yyoia
authors: Chen, Ming
title: Collective variable-based enhanced sampling and machine learning
date: 2021-10-20
journal: Eur Phys J B
DOI: 10.1140/epjb/s10051-021-00220-w
sha: 0f4f0a9a8a4157d562ba7604e532cca26a5cf415
doc_id: 957615
cord_uid: 2h5yyoia

ABSTRACT: Collective variable-based enhanced sampling methods have been widely used to study thermodynamic properties of complex systems. Efficiency and accuracy of these enhanced sampling methods are affected by two factors: constructing appropriate collective variables for enhanced sampling and generating accurate free energy surfaces. Recently, many machine learning techniques have been developed to improve the quality of collective variables and the accuracy of free energy surfaces. Although machine learning has achieved great successes in improving enhanced sampling methods, there are still many challenges and open questions. In this perspective, we shall review recent developments on integrating machine learning techniques and collective variable-based enhanced sampling approaches. We also discuss challenges and future research directions including generating kinetic information, exploring high-dimensional free energy surfaces, and efficiently sampling all-atom configurations. GRAPHIC ABSTRACT: [Image: see text]

Molecular dynamics (MD) simulation is an important tool to study thermodynamic properties and kinetic properties of complex systems in chemistry, biology, and materials science [1] . Potential energy functions used in MD simulations usually have tremendous local minima separated by high-energy barriers, while thermal fluctuation is the only driving force of barrier crossing. Therefore, the time to cross high barriers is close to or longer than typical MD simulation timescales and a sufficient sampling of barrier-crossing events can require millisecond-scale MD simulations [2] . Various enhanced sampling methods have been designed to assist barrier crossing and these methods have achieved great successes in understanding properties of various chemical systems. [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] .

In general, there are two categories of enhanced sampling methods that focus on studying thermodynamic properties. The first one is unbiased enhanced sampling which preserves the Boltzmann distribution. The efficiency of this type of enhanced sampling methods is bounded by the central limit theorem, thus reducing sample correlation is the main theme of methodology development. One famous example of unbiased enhanced sampling is the replica exchange method such as parallel tempering [3] or Hamiltonian replica exchange [4] . In the replica exchange approach, multiple replicas of one MD simulation are running simultaneously with different temperatures or different potena e-mail: chen4116@purdue.edu (corresponding author) tial energy functions. Sample correlation is reduced by exchanging configurations among replicas to prevent trapping the simulated system in a stable conformation for a long time. The second type of enhanced sampling methods biases the configuration distribution away from the Boltzmann distribution. There are two motivations to bias the Boltzmann distribution. First, biasing the Boltzmann distribution improves statistics at important high free energy locations, including metastable conformations and transition states. Second, increasing the probability of visiting barrier tops can often reduce sample correlation to enhance sampling efficiencies. Since preserving the Boltzmann distribution is not required, designs of biased enhanced sampling are more flexible. Even non-equilibrium dynamics [5] [6] [7] [8] [9] has been adopted as long as the Boltzmann distribution can be recovered with post-analysis. Besides two categories of methods for sampling the Boltzmann distribution, there are many methods focusing on sampling in a path ensemble which are necessary for studying kinetic properties like rate constants [18] [19] [20] [21] [22] [23] [24] [25] . We want to emphasize that these methods are important members in the family of enhanced sampling even if they are not the main topic of this perspective.

Most biased enhanced sampling methods focus on several important degrees of freedom. These degrees of freedom are named as collective variables (CVs). CVs form a reduced model for a complex process, like a chemical reaction, a biomolecule conformational change, or a material phase transition. With properly selected CVs, important potential energy barriers are mapped onto a free energy surface (FES) so

We start our discussion from introducing a system of N a atoms with Cartesian coordinates x = {x 1 , ... , x Na } where x i is the Cartesian coordinates of the i'th atom. The system's Hamiltonian is H = T + U where T is the kinetic energy and U (x) is a potential energy function which describes interactions among atoms. A set of d physically intuited CVs q(x) = {q 1 (x), ... , q d (x)} ∈ R d is introduced as a coarse-grained representation of the system, where q i (x) is the i'th collective variable. Fixing q(x) = s, a partition function at temperature T is defined as

where C is a constant independent of s, β = 1/kT , and δ(·) is the Dirac delta function. In other words, Q(s) is defined as a integration of the Boltzmann factor e −βU(x) over a manifold determined by q(x) = s. A free energy surface A(s) is then defined from Q(s), i.e.

whereC is also a constant independent of s. Since we are interested in a free energy difference between two conformations,C is ignored in most cases. Since a FES is a simplified model of a complex system, the FES should be able to represent important macroscopic states of molecules and materials, such as stable conformations of molecules and stable phases of materials. Moreover, it is possible to design CV-based enhanced sampling with appropriate CVs to calculate kinetics properties like rate constants [35] . In many cases, timescales of these important conformational changes or phase transitions are much longer then typical MD simulation timescales due to high free energy barriers. For example, conformations of alanine dipeptide in vacuum can be represented by two Ramachandran dihedral angles Φ and Ψ . Using these two dihedral angles as CVs defines a FES on which three minima represent three stable conformations: C7 eq , C5, and C7 ax . The height of the barrier separating C7 eq and C7 ax is nearly 8kcal/mol higher than the free energy at the bottom of C7 eq well (see panel (b) in Fig. 1 ). Crossing such a high barrier requires extremely long timescale MD simulations. Therefore, efficiently exploring a FES usually requires enhanced sampling techniques.

One common enhanced sampling approach is to bias the Boltzmann distribution to enhance statistics in some low probability regions, like metastable conformations and transition states. Biasing the Boltzmann distribution is achieved by modifying the equation of motion. Without loss of generality, we shall use the Brownian dynamics as the MD equation of motion. The Brownian dynamics motion equation is

where μ is a friction coefficient. Sometimes, an extended Lagrangian scheme is used in CV-based enhanced sampling methods. In an extended Lagrangian scheme, CVs are coupled to fictitious degrees of freedom s with harmonic potentials [7, 8, 17, 37] , i.e.

where μ i is an artificial friction coefficient of a fictitious degree of freedom and κ i is an artificial coupling constant. With Eqs. (4a) and (4b), Eq. (1) becomes

Eq. (5) agrees with Eq. (1) in the limit of κ i → ∞ for all κ i . Biased enhanced sampling methods modify either Eq. (3) [5, 6, 9] or Eqs. (4a)-(4b) [7, 8, 16, 17, 37, 38] . There are two possible ways to biasing the Boltzmann distribution: changing the potential energy function [5, 6, 9, 17, [38] [39] [40] [41] and increasing the temperature [7, 8] . A biasing potential or biasing force is usually introduced to reshape the potential energy function. The biasing potential can either restrain the simulated molecule around a conformation or the biasing potential can fill up minima on a FES to enhance barrier crossing. One famous example of enhanced sampling with restraint potential is the umbrella sampling. The biasing potential used in the umbrella sampling shares a form of

(b) (a) (c) Fig. 1 The FES of an alanine dipeptide a in vacuum with two Ramachandran dihedral angles Φ and Ψ as CVs is shown in panel (b). The FES in kcal/mol is calculated by a 10ns well-tempered metadynamics simulation with the OPLS-AA force field [36] . Panel c shows trajectories of dihedral angle Φ from a MD simulation (upper) and from the well-tempered metadynamics simulation (lower). Without enhanced sampling, the alanine dipeptide molecule is not able to change its conformation to C7ax within a 10 ns simulation due to high free energy barriers. On the contrary, well-tempered metadynamics significantly enhances barrier crossing between C7eq/C5 and C7ax [10, 41, 42] . There are many enhanced sampling methods focus on designing a biasing potential to assist barrier crossing [5, 6, 9, 16, 39, 40, 43] . For example, metadynamics constructs a biasing potential by depositing Gaussian functions along a simulation trajectory {x(t)} [5, 6] , i.e.

where Σ is a diagonal matrix determining the width of Gaussian functions and A(x) is a prefactor. A(x) is a constant in the metadynamics [5] while A(x) declines with increasing U bias (x, t) in the well-tempered metadynamics (WTM) [6] . There are many different modifications of metadynamics which can be found in an excellent review [44] . Since nuclear forces control atomic motions in MD simulations, directly applying a biasing forces instead of a biasing potential leads to the adaptive biasing force (ABF) method [9] . In ABF, mean forces F(q) felt by CVs are recorded and biasing forces F bias = −F(q) are applied to drive the system out of a stable/metastable conformation [9] . There are also methods applying potentials to restrain a system as well as to filling up minima on the FES. A method, named as "well-sliced metadynamics" that unifies both two types of biasing potentials has also been proposed [38] . In this method, a restraint potential has been applied to enforce the simulated system explore interesting conformations, while a biasing potential of metadynamicstype has also been used to enhance barrier crossing.

In "funnel metadynamics" which is useful to study the drug binding problem, a funnel-like restraint potential has been designed to confine a drug molecule's diffusion if the drug molecule is too far away form the protein surface, and a biasing potential of metadynamics-type can enhance sampling of the binding process [45] . As mentioned above, another approach to bias the Boltzmann distribution is to increase the simulation temperature. A naive way of increasing the whole system's temperature is not practical. However, increasing the simulation temperature for CVs is feasible, and it is the core idea of the adiabatic free energy dynamics (AFED) [46] , the driven adiabatic free energy dynamics (d-AFED) [8] , and the temperature accelerate molecular dynamics (TAMD) [7] . In order to avoid strong nonequilibrium effects, large masses are assigned to CVs to create an artificial time-scale separation between CVs and other degrees of freedom. It is also possible to bias the Boltzmann distribution by applying both a biasing potential and a high temperature to CVs, like the unified free energy dynamics [16] . We want to emphasize that we only mention a few CV-based enhanced sampling methods in order to motivate further discussions. For more completed and detailed introductions to CV-based enhanced sampling, please refer to following outstanding reviews [47, 48] .

Although trajectory-based enhanced sampling is not the main topic, we want to briefly introduce the ideas of trajectory-based enhanced sampling with three illustrative methods. Trajectories, like configurations, can form an ensemble known as the path ensemble, assigning path probabilities to trajectories in the ensemble [18] . Sampling paths in the path ensemble leads to accurate estimations of kinetic properties, e.g. transition probabilities, correlation functions, and rate constants [18] [19] [20] [21] [22] [23] [24] [25] . In transition path sampling [18, 19] , a initial path is first proposed to connect two conformations A and B. The path is described by a series of "beads" with phase space coordinates {(x 1 , p 1 ), ... , (x N , p N )} where (x 1 , p 1 ) stay in conformation A and (x N , p N ) stay in conformation B. A Monte Carlo approach named as "shooting move" [49] has been proposed to update the path, analogy to updating coordinates in a Monte Carlo sampling. In transition interface sampling [20] , N interfaces are aligned between conformations A and B. The rate constant from A to B can be evaluated by fluxes of trajectories through interfaces and transition probabilities of crossing a interface. Similarly, the milestoning method [25, 50] divides the configuration space to cells. Short MD simulations are performed in each cell to evaluate the mean first passage time for the system to leave one cell. These mean first passage times are then used to evaluate other kinetic properties. While the main object of enhanced sampling focusing on thermodynamic properties is the invariant probability p(x), the main object of trajectory-based enhanced sampling is the transition probability p(x , t |x, t) since the transition probability determines path probabilities. Trajectory-based enhanced sampling methods like milestoning are able to recover the invariant probability for free, as long as the transition probability is evaluated in these methods. Trajectory-based sampling methods are also strongly related to CVs. (1) Information provided by trajectorybased sampling methods can be used to identify appropriate CVs. One famous example is the committor analysis from transition path sampling [51] . (2) Predefined CVs are needed for some trajectory-based sampling methods. For example, cells used in the milestoning method are usually defined with CVs [52, 53] .

Although CV-based enhanced sampling methods have been widely used in chemistry, biochemistry and materials science, there are two fundamental issues limiting applications of these methods. First, it is an open question on how to construct optimal CVs. Second, accurately generating a multidimensional FES is still challenging. Fortunately, recent developments on machine learning techniques bring a revolution to CVbased enhanced sampling methods. In the next section we shall introduce some basic concepts in machine learning and we will discuss how to develop and apply machine learning techniques to lift obstacles in enhanced sampling methods in Sects. 4 and 5.

In physical sciences, new theories are usually established based on inferences. Starting from fundamental assumptions or experimental results, step-by-step inferences lead to conclusions, rules, and theories. These conclusions, rules, and theories are further applied to new problems to explain observed physical phenomena and to predict unknown experimental results. Besides inference-based approaches, another way to explain existing data and to predict new results focus on studying data directly with theories and algorithms from statistics, optimization, computer sciences, and so on. The second approach is known as "machine learning". "Data" is the most important component of machine learning. In general, there are three different types of data: (1) a static set of data with labels, e.g. a set of data {(x, f (x))} where f (x) serves as the label of x; (2) a static set of data without labels, e.g. a set of data {x}; (3) data depending on the on-the-fly execution of a program or an experiment. Three different types of data leads to three different types of machine learning methods: supervised learning, unsupervised learning, and reinforcement learning. In this section, we shall adopt the linear regression method which is one of the most elementary supervised learning method to illustrate how to train a model, how to validate a trained model, and how to predict with a trained model. We want to emphasize that we are only presenting the outline of the linear regression theory. For details and further discussions, please refer to [54] .

In a data set X = {(x, y)} with N samples ("training data set"), y is related to x with y i = f (x i ) + ε i where f (·) is an unknown function and ε i is a random noise. The linear regression model assumes that the unknown function can be represented as a linear combination of

where w is a vector of expansion coefficients. Equation (8) is named as a "model" with adjustable ("trainable") parameters w. To recover f (x), w has to be optimized such that w φ(x i ) matches y i reasonably well for every data point. This idea leads to the least squares objective function or the least squares loss function.

"Training" the linear regression model means minimizing L. Once L is minimized, the optimal w, named as w * , is given by

where Φ ij = φ j (x i ). With a new input x * , the trained linear regression model predicts an output y * = w * φ(x * ). An obvious question with this linear regression model is how to choose an optimal M . If M is too small, the model is not flexible enough to approximate f (·) accurately, resulting in systematic errors in predictions. On the other hand, if M is too large, the model is overfitted and noisy predictions are generated.

In the scenario of overfitting, the training error given by L(w * , X ) is very small. However, a test error L(w * ,X ) can be very large.X = {(x,ỹ)} is another data set ("test data set") independent of X . It is because that the magnitudes of some w i become artificially large in order to learn the noise ε. To solve this problem, it is often necessary to restrain the magnitude of w by adding a regularization term to L, i.e.

where λ is called a "hyperparameter". λ determines the ratio between the systematic error introduced by the regularization term and the random error due to overfitting. λ is not a trainable parameter and cross-validation [55, 56] is required to find out an optimal value of λ. The linear regression method can also be interpreted with probability theory. Assuming that the noise ε is distributed as a normal distribution with zero mean and variance σ, a probability p(y|x, w, σ) can be defined as p(y|x, w, σ) = N (y|w φ(x), σ). If all data points in the training data set are independent, a "likelihood" probability is defined as

which tells us the probability to observe y = (y 1 , ... , y N ) , given a set of parameters w and inputs x = (x 1 , ... , x N ) . A large likelihood probability means that we have a better chance to observe y, therefore, w is optimized if we maximize p(y|x, w, σ). Such an approach is named as "maximum likelihood". Logarithm of the likelihood probability leads to the least squares loss function given by Eq. (9) . Both the least squares approach and the maximum likelihood approach result in a single optimal w * . On the contrary, the Bayesian approach returns a model that tells us the probability of w. The Bayesian approach of linear regression starts from a famous relation:

where p(w) as a "prior" distribution describes our prior knowledge of p(w) and p(w|y) is called a "posterior" distribution. Usually, a prior distribution is an elementary probability distribution, e.g. p(w) = N (w|0, λ −1 I).

Using this prior distribution together with the likelihood in Eq. (12) , the logarithm of the posterior distribution is

where we discard the normalization constant. Therefore, maximizing the posterior which tells us the most possible choice of w is exactly the same as minimizing the loss function Eq. (11) . The similarity between Eqs. (11) and (14) suggests that a prior distribution behaves like a regularization term: our prior knowledge prevents w from taking crazy values if there is not enough training data. However, the power of the Bayesian approach is far beyond adding a regularization term. Given a new input x * , the Bayesian approach can predict the distribution of y * which is given by

Equation (15) provides a full statistics of a prediction including mean and confident interval.

Up to now, we have focused on the elementary linear regression model to introduce various machine learning concepts. Besides linear regression, we also want to briefly introduce the artificial neural network model [57] [58] [59] that is commonly used in recent years [60] . An artificial neural network shares a layered structure. We define x (i) as the i'th layer's inputs. The i'th layer's outputs, y (i) , are given by

where W (i) and b (i) are trainable parameters called weights and biases. An activation function f (·) is a nonlinear function to introduce non-linearity in the model, typical choices of f (·) include sigmoid function, hyperbolic tangent function, and rectified linear unit (ReLU) [61] . y (i) are then fed into the (i + 1)'th layer as inputs.

A simple neural network contains only three layers: an input layer, a hidden layer, and an output layer. Nowadays, deep neural network can have 10 2 -10 3 layers [62] .

Besides the most elementary neural network form introduced above, there are many other important architectures of artificial neural networks including the convolutional neural network [63] , the recurrent neural network [59] , the residual neural network [62] , the graph neural network [64] , and so on. Training a neural network is usually achieved by the backward propagation algorithm [59, 65, 66 ].

Designing CVs is a key step when applying a CV-based enhanced sampling method. Traditionally, physically intuited CVs are used in enhanced sampling simulations. A physically intuited CV either corresponds to an experimentally measurable property or it is designed by understanding underlying physics of the studied process. For example, end-to-end distance is related to mechanical pulling experiments of biomolecules [67] . and radius-of-gyration is designed to describe the "compactness" of a molecule [68] . However, applying sampling methods to study complex biomolecules and materials with limited number (≤ 3) of physically intuited CVs is very challenging. Sampling a complex system efficiently may require 10-100 or more physically intuited CVs [27, 69, 70] . It is because that many conformational transitions could happen when simulating a complex system, while one physically intuited CV is appropriate to describe only a few conformational changes. A limited number of physically intuited CVs are not enough to represents all of these conformational transitions, and these CVs are not enough to map all important transition barriers explicitly on the FES [71] . Since CV-based enhanced sampling methods are only capable of enhancing barrier crossing for barriers on the FES, barriers not on the FES can significantly reduce the efficiency of a CV-based enhanced sampling algorithm [71] or even reduce the accuracy of sampling results [72] . In [71] , an alanine dipeptide molecule was simulated with metadynamics using the radius-of-gyration and the number-of-hydrogen bond as CVs. Barriers connecting C7 eq and C7 ax (see Fig. 1 , panel b) were not explicitly mapped on the FES, resulting in a long simulation time for the alanine dipeptide molecule to leave the C7 ax conformation. In [72] , linearresponse theory was applied to evaluate the systematic error of a FES generated by the TAMD method. The error which is caused by the non-equilibrium factor in TAMD is related to correlation functions of forces felt by CVs and barriers not explicitly mapped on the FES slow down the decay of correlation functions, resulting in a large systematic error of the FES. The motivation to use physics-intuited CVs is that the corresponding FES is physically meaningful. Nevertheless, it is possible to simulate a complex system with sophisticated CVs even if the physical meaning of the FES is insignificant [73, 74] . In this case, generating a FES is not the main subject of the simulation. Instead, sample weights are evaluated by unbiasing algorithms [75] [76] [77] [78] [79] [80] [81] [82] [83] , and unbiased samples are projected onto physics-intuited CVs in post-analysis [71] . Without requiring physics intuited CVs, a lot of methods have been developed to construct CVs directly from mining simulation data (configurations) [71, 73, [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] , which greatly enriches the CV library. Designing CVs which aims to find a low-dimensional representation from high-dimensional configurations exactly matches the task of dimensionality reduction algorithms. Dimensionality reduction methods have been widely studied in the machine learning community. The motivation to develop dimensionality reduction methods is an assumption that data points stays around a low-dimensional manifold even if the data points are embedded in a high-dimensional space. The panel (a) of Fig. 2 presents a set of two-dimensional data points. The points concentrate around a curve (onedimensional manifold) instead of being randomly scattered across the two-dimensional space. Therefore, it is possible to find out a one-dimensional representation of the data points without prior knowledge of the curve, which is the goal of dimensionality reduction methods. Examples of dimensionality reduction algorithms include principle component analysis (PCA) [97] , isomap [98] , locally linear embedding (LLE) [99] , diffusion map [100] , t-distributed stochastic neighbour embedding [101] and many others [102] . There are various studies applying different dimensionality reduction algorithms to construct CVs [92] [93] [94] 103, 104] . How-ever, most dimensionality reduction algorithms only use information of the data probability distribution, while configurations from a MD simulation also represent simulation kinetics. It is often believed that kinetics provides the key information for CV design. For example, the isosurface of a good CV should be an approximation of the committor isosurface which is determined by kinetic information [105, 106] . Therefore, it is natural to utilize kinetic information to construct a machine learning-based CV. In practice, other criteria like preserving structure similarity are also used in training machine learning-based CVs. In the next section we shall present several example to train CVs from MD simulations.

Principle component analysis (PCA) We start our discussions from the principle component analysis, which is an elementary dimensionality reduction algorithm. PCA attempts to decompose the sample covariance matrix and to find out directions with large variants. As shown in the panel (a) of Fig. 2 , v 1 is the eigenvector of data covariance matrix with the larger eigenvalue. Therefore, v 1 could represent a "flexible mode" while v 2 , the eigenvector with the smaller eigenvalue, may represent a vibrational mode of less interest. Therefore, v 1 is a more suitable CV compared to v 2 . PCA is a very simple approach to train CVs, and its drawbacks are obvious. The CV v 1 is a linear combinations of coordinates x 1 and x 2 . However, panel (a) of Fig. 2 clearly shows that the actual low-dimensional manifold is non-linear. Since linear PCA is not able to accurately model a non-linear manifold, a non-linear dimensionality reduction method is required for constructing CVs.

Sketch map The second direction to develop machine learning-based CVs is based on preserving the structural similarity. In most cases, Cartesian coordinates x of the studied molecule, like a biomolecule, are clustered around stable conformations. In order to preserve clustering information, sketch map trains CVs based on non-linear distance matching with the following objective function:

where both F (·) and f (·) are switch functions [73, 74] .

. L is small only if a cluster of configurations stay within one cluster in the low-dimensional CV-space. Sketch map has been used to identify stable conformations from enhanced sampling simulations [73, 74, 107, 108] . t-Distributed stochastic neighbor embedding (t-SNE) Besides measuring similarity between two samples with distance, probability distributions are also used to describe sample similarity. This idea leads to a method named as "stochastic neighbor embedding (SNE)" [109] . In SNE, a conditional probability is first defined, (a) (b) Fig. 2 Panel a illustrates the dimensionality reduction problem. Red two-dimensional points are scattered around a blue curve which is unknown to a dimensionality reduction algorithm and the goal of the dimensionality reduction algorithm is to recover the blue curve. Green arrows are corresponding to eigenvectors of the sample covariance matrix. The longer arrow, v1, is the eigenvector with the larger eigenvalue, while the shorter arrow, v2, is the eigenvector with the smaller eigenvalue. Panel b shows the FES corresponding to two StKE CVs [71] . Configurations are sampled by a ∼ 7 ns active enhanced sampling simulation with the OPLS-AA force field [36] i.e.

where K h (·, ·) is a symmetric kernel function. A joint probability is then defined as

where N is the number of samples. The design of the joint distribution ensures every sample contributes significantly to the loss function. Similar joint probabilities p l ij are defined with respect to a low-dimensional representa-

for i = j and p l i|i = 0. The Kullback-Leibler (K-L) divergence between p h ij and p l ij is minimized to preserve data similarity, i.e. the objective function of SNE is

where D KL (· ·) denotes the K-L divergence. In SNE, both K h and K l are Gaussian kernels. However, Gaussian kernels suffer from the "curse of dimensionality". The distance between two high-dimensional data points is usually much longer then the distance between their low-dimensional projections. Therefore, a fat tail kernel has been used to faithfully preserve the distance between two moderately separated high-dimensional samples. The idea of using mismatched kernel tails to compensate mismatched dimensionality leads to t-SNE in which K l is a Student's t-distribution [101] . t-SNE has been applied to train a low dimensional representation of configurations to analysis MD trajectories [104] . Autoencoder An autoencoder model contains an encoder and a decoder. The encoder is a function f e that maps high-dimensional inputs x to low-dimensional latent variables q while the decoder f d decodes q back to high-dimensional outputs x [110] . Nowadays, both f e and f d are usually represented by neural networks. The encoder is a natural choice of CVs: x represent Cartesian coordinates or high-dimensional features of a configuration while q are low-dimensional CVs. There is no unique way to design an encoder and we will introduce several examples.

In the first example, an autoencoder has been trained by optimizing the following loss function: (20) where the regularization term is usually a summation of the L 2 norm of parameters in f d and the L 2 norm of parameters in f e . Equation (20) minimizes the difference between an encoder's inputs and the corresponding outputs of the decoder [91] . In practice, x could be atomic Cartesian coordinates or internal coordinates of the studied molecule. If x are internal coordinates which are invariant under rigid rotation, Eq. (20) can be applied directly. However, further modifications of Eq. (20) are needed to train rotation-invariant CVs if Cartesian coordinates are used [91] as the encoder's inputs. The trained encoder can be used to indicate regions in x space that are already been sampled by MD simulations. Umbrella sampling simulations are then employed with restraint potentials located at boundaries of the sampled regions. Configurations from these simulations further expand the sampled regions. The iterations continue until the sampling process converges [91] . In the second example, the objective function to train an autoencoder is a linear combination of coordinate-matching objective function (Eq. 20) and sketch-map objective function (Eq. 17) [111] . This work further demonstrates that the trained decoder is capable of generating all-atom structures from latent variables. In the third example, an autoencoder model has been developed by integrating kinetic information, e.g. the input of encoder is x at time t and the output of decoder is x at time t + τ [112] .

A recent development of autoencoder theory is based on the Bayesian theory. This autoencoder method, named as "variational autoencoder" is a generative model to generate x following a probability distribution p(x) [110] . In the variational autoencoder model, an "Evidence Lower Bound" (ELBO) [54] loss function is optimized, i.e.

wherep φ (q|x) approximates p(q|x) and p(q) is a prior distribution of q which is usually a standard normal distribution. The first term of Eq. (21) is a regularization term to make surep φ (q|x) staying close to p(q). Minimizing the second term of Eq. (21) attempts to match the decoder outputs with the encoder inputs, i.e. the decoder has the largest probability to output x if this x are the encoder's inputs. In practice,p φ (q|x) is a normal distribution whose mean and standard deviation are outputs of the encoder neural network. p θ (x|q) is another normal distribution with mean generated by the decoder neural network and standard deviation as hyperparameters. The decoder can be easily adopted as a configuration generator [95] , while applying the encoder as CVs is not straightforward. Variational autoencoder suggests that q is a random variable instead of a deterministic function of x. In order to use q in a enhanced sampling method like metadynamics, an extra fitting step has been designed to train a deterministic function q d (x) such that the K-L divergence between the probability of q d and the marginal probability distribution of q is minimized [86] .

Classification neural network Recently, artificial neural networks have achieved great successes in classification problems. In a classification problem, e.g., recognizing objects in an image, the image is fed into a neural network and the neural network returns probabilities of assigning different classes to a pattern on the image. Similarly, local order parameters are also "classifiers" that indicate whether a local structure of a solid state material belongs to a fcc structure, a bcc structure, a disordered structure or other structures. A classification deep neural network which outputs probabilities of assigning crystal structure classes performs as a local order parameter q [96] . A global order parameter is then defined as Q = 1 N i q i , where i loops over N atoms. Q has been applied successfully to study solid state phase transitions [96] .

Time-lagged independent component analysis (TICA) As discussed above, preserving kinetic information is an important guideline of CV design. One successful approach to train CVs with kinetic information is "Time-lagged Independent Component Analysis" (TICA). In TICA, a time correlation matrix C is built with a set of predefined high-dimensional trail CVs, i.e. C ij (τ ) = q h i (x(t))q h j (x(t + τ )) where τ is a lag time [113, 114] . The motivation of TICA is obvious: optimal CVs should represent slow modes which are characterized by slow-decay correlation functions. By solving the generalized eigenvalue problem C(τ )q = λC(0)q, an optimal CV is the eigenvector with the largest eigenvalue. The lag time τ should be selected to distinguish fast and slow modes. TICA is an important tool to construct a Markov state model [88, [115] [116] [117] and TICA CVs trained from MD simulations are able to accelerate metadynamics simulations [89] . With proper approaches to unbias correlation functions from biased enhanced sampling, it is also possible to use biased enhanced sampling trajectories to train TICA CVs [118] .

Past-future information bottleneck (PIB) Studying correlation functions is not the only way to utilize kinetic information, such information can also be recovered by directly investigating the relationship between x(t) and x(x + Δt). Following this idea, CVs are constructed by the principle of past-future information bottleneck [87, 119] . In past-future information bottleneck, CVs q(x) are trained by maximizing an objective function q(x(t) )), (22) where I(u, v) is mutual information between two random variables u and v. Maximizing I(q(x(t)), x(t+Δt)) means q(x(t)) contains as much information as possible to predict the future states x(t + Δt) while minimizing I(x(t), q(x(t))) keeps the form of q as simple as possible [87] .

Diffusion map It is also possible to approximate kinetic information, like transition probabilities, with the Boltzmann distribution. L, the infinitesimal generator of Eq. (3), describes dynamical behaviors of a system, i.e. the time-dependent probability density ρ(x, t) becomes

where λ i > 0 and ψ i is the i'th eigenvalue and the i'th eigenfunction of L and ρ(x) is the Boltzmann distribution. c i are determined by the initial probability distribution. Equation (23) suggests that ψ(x) with small λ are "slow" degrees of freedom. Thus eigenfunctions ψ(x) with small λ become natural choices of CVs [94, 120] . However, computational costs of solving the eigenfunction problem of L are prohibitively high for realistic systems. Fortunately, diffusion map provides an alternative approach to approximately solve the eigenproblem of L [100, 121] .

Assuming a set of sample {x i } generated from Eq. (3), a kernel matrix is defined as y) is a Gaussian kernel function with broadening σ. The kernel matrix is then scaled by the Boltzmann distribution, i.e. D ij = K ij /( ρ(x i ) ρ(x j )). Finally, D ij is normalized to form a transition matrix: M ij = D ij /( k D ik ). In the limit of infinite samples and σ → 0, the right eigenvectors of M weakly converges to the eigenfunction of L [100, 121] . Diffusion map has been used to analysis MD simulation data [94] and diffusionmap-based enhanced sampling methods have also been developed [120] .

Spectral gap optimization of order parameters (SGOOP) The second example of training kineticsbased CVs with thermodynamic properties is spectral gap optimization of order parameters (SGOOP) [84] . In SGOOP, grids are first built in the low-dimensional CV-space. The time-dependent probability of CVs on the n'th grid point, p n (t), is propagated with

where m and n loop over all grids in the CV-space. A transition probability Ω is estimated by Ω = exp{Kδt} ≈ I + Kδt where δt is a small time interval. Ω is generated by the maximum caliber approach, i.e. maximizing entropy of microscopic paths with physical constraints [122] . The entropy is defined as

A path-dependent physical observable A discretized on grids is denoted as A mn . Optimal Ω mn can be obtained by maximizing S defined in Eq. (25) with a constraint that the path-ensemble averaging of A mn equals a certain value. It has been shown that solving the maximization problem leads to Ω mn = pn pm e −λAmn , where λ is the Lagrange multiplier [123] . In the simplest case, the average times of transitions with A mn = 1 is the only constraint physical quantity, leading to Ω mn = pn pm e −λ which has been used in SGOOP. Ω as well as its eigenvalues ε varies with different CVs. Maximizing the spectral gap |ε α − ε α+1 | results in optimal CVs where α represents the number of barriers on the FES [84] .

Stochastic kinetic embedding (StKE) Although diffusion map provides an approach to project existing samples, using ψ i (x) as CVs in enhanced sampling simulations requires explicit function form of ψ i (x) so that ∇ψ i (x) can be evaluated analytically. We will introduce an alternative method, named as stochastic kinetic embedding (StKE) [71] , which is based on diffusion map to construct differentiable CVs.

Assuming that a system can be described by a large number of physically intuited CVs q h (highdimensional CVs) with a FES A h (s h ) at q h (x) = s h , StKE approximates dynamics of CVs as a Brownian dynamics, i.e.μ ds h = −∇A h (s h )dt + 2μkT dW (26) whereμ is an effective friction coefficient. In principle, a generalized Langevin equation should be used to accurately model CV dynamics [124, 125] . However, several studies have suggested that Brownian dynamics is a reasonable approximation of CV dynamics in metadynamics simulations [126, 127] . ; θ) . If s l is an optimized low-dimensional representation of s h , M l ij should be close to M h ij . Therefore, the K-L divergence has been used to train s l (s h ; θ), i.e. the loss function becomes

In practice s l (s h ; θ) is modeled by a deep neural network and θ are trained by optimizing Eq. (27) . Panel (b) of Fig. 2 

One of the most important goal of an enhanced sampling simulation is to generate a FES associated with selected CVs. There are different approaches to calculate FES from simulation trajectories according to different enhanced sampling methods. For example, weights can be assigned to configurations from umbrella sampling via the weighted histogram analysis method (WHAM) [79] , the multistate Bennett acceptance ratio (MBAR) [80] , and the family of transitionbased reweighting analysis methods (TRAM) [81] [82] [83] .

Metadynamics is able to generate a FES either by inverting a biasing potential or by unbiasing samples [5, 6, [75] [76] [77] 83] . In d-AFED/TAMD, a FES is calculated with the probability of s or by fitting mean forces. [7, 8, 16, 128] . In ABF, a FES is fitted by matching mean forces if d > 1 [9] . In general, a FES is usually evaluated with CV probability and/or mean forces. Histogram or kernel density estimation is a common approach to evaluate a probability distribution. With kernel density estimation, the probability at point s is given by ρ(s) = 1 N i ω i K(s, s i ), where ω i is the weight of the sample s i and K(s, s i ) is a symmetric, positivedefinite kernel function. A kernel can either have a fixed bandwidth (broadening) or a flexible bandwidth [129] [130] [131] . Besides kernel density estimation, there are other machine learning approaches to learn a probability distribution. For example, a mixture model like a mixture of Gaussians can also model a probability distribution. The Gaussian mixture model is usually used as a clustering algorithm with a model probability distribution (likelihood)

where p i is a marginal distribution of s corresponding to the i'th cluster [132] . N (s|μ i , Σ i ) is a normal distribution function with mean μ i and covariance matrix Σ i associated with the i'th cluster. Training a Gaussian mixture model with maximum likelihood results in optimized μ i , Σ i , and p i , as well as probabilities of identifying each sample in all clusters. The log likelihood function of a Gaussian Mixture model is i log ρ(s i ) and maximizing the log likelihood function is equivalent to minimizing the K-L divergence D KL (ρ samp ||ρ GM ), where ρ samp is the sample probability distribution. Therefore, optimizing ρ GM not only classifies samples but also establishes an optimized probabilistic model of ρ samp . In practice, the number of Gaussians used in the Gaussian mixture model can be determined by Bayesian inference criterion which is a criteria for model selection [133] . Gaussian Mixture models have been used in enhanced sampling algorithms to approximate CV distributions [70, 133] . Besides Gaussian Mixture models, some generative models are also capable of predicting the probability of a sample besides generating new samples. For example, propagating along one direction of a normalizing flow model [134] generates a random sample while propagating along the other direction of the normalizing flow model returns the probability of an input sample. Detailed discussions about normalizing flow models can be found in the next section.

Another approach to generate a FES is to "integrate" mean forces. If d = 1 and Eq. (3) is used, mean force F (s) at q(x) = s can be evaluated by

where J is the Jacobian matrix of generalized coordinates [11] . Theoretically, applying Eq. (29) requires full knowledge on the Jacobian matrix, which is impracticable for CVs with complex function forms. To simplify Eq. (29), alternative formulas have been proposed in ABF and blue moon sampling [135] [136] [137] . The formula to calculate mean force is greatly simplified with the extended Lagrangian framework, i.e.

We want to emphasize that extending Eq. (30) to d > 1 is trivial, which is suitable for multi-CV simulations. However, a systematic error exists in F el with finite κ. Decreasing the systematic error requires increasing κ, while increasing κ can significantly enlarge the variance of κ(q(x) − s), leading to a large statistical error. In practice, κ should be tuned to balance the systematic error and the statistical error. Recently, a new mean force estimator, named as "corrected zaveraged restraint (CZAR) estimator", has been developed, which can significantly reduce the systematic error [17] .

With the CV probability and/or mean forces, it is possible to construct a FES with various machine learning models. The simplest model is linear regression which expands the FES with a linear combination of basis functions, i.e.,

where {C i } are expansion coefficients and {Φ i (s)} are basis functions. This approach has been applied to metadynamics in the extended Lagrangian framework [128] , unified free energy dynamics [16] , ABF [9, 17] and other methods [41, 138] . Linear regression has been widely used due to its quadratic objective function with a unique global minimum. However, this method scales poorly with the number of CVs (dimensionality) d.

For example, constructing a four-dimensional FES is already non-trivial [16] . Therefore, applying the linear regression method to learn a high-dimensional FES is not feasible. Other advanced machine learning methods have been used to train a FES. We will describe several models in this section to introduce basic ideas of training a FES.

Artificial neural network Artificial neural networks have been used in various studies to train FESes due to their capabilities to accurately approximate arbitrary smooth functions with reasonable computational costs [139] . In one study, a neural network has been trained by matching the neural network's gradient with mean forces (force estimator) [140, 141] . Fitting a neural network with free energies evaluated at different points in the CV-space (energy estimator) has also been proposed [142] . Although a neural-network-based FES can be trained with either a force estimator or an energy estimator, combing two estimators can significantly improve the accuracy of the trained FES [143] . Besides training a neural-network-based FES as a post-analysis of enhanced sampling simulations, neural networks also serve as biasing potentials in various enhanced sampling methods. For example, an artificial neural network has been used to construct a biasing potential in the variational enhanced sampling method where the converged biasing potential can further be used to evaluate the FES [144] . Within the ABF framework, a neural network has been trained to provide smooth estimations of mean forces [145] . In a reinforcement-learningbased enhanced sampling approach, a neural-network-represented FES can indicate regions in the CV space with insufficient samples to guide MD simulations to explore these regions [141] .

Kernel methods Kernel methods such as kernel ridge regression (KRR) and support vector regression have also been tested as possible models to train a FES [142] . Kernel regression is closely related to linear regression by recasting the regression problem with dual formulation [54] . Therefore, kernel regression deals with kernel functions instead of finite basis functions, which allows to implicitly use a large number or even infinite number of basis functions. We will use the KRR as an illustrative example. In KRR, a FES is approximated with

where i loops over N samples and K(·, ·) is a kernel function. Trainable parameters α are determined by

. λ, as a hyperparameter, controls the regularization strength. Intuitively, KRR attempts to "interpolate" the free energy at a new location s by measuring the similarity between s and each sample s i with the kernel function and by assigning an appropriate weight α i to K(s, s i ). It is also possible to establish a kernel method via the Bayesian theory. One famous example is the Gaussian process regression (GPR) method [54] . In GPR, a Gaussian process is used as a function prior. Gaussian process is a stochastic process y(t) such that the joint distribution of {y(t 1 ), ... , y(t n )} is a multivariate normal distribution with any t 1 , ... , t n , i.e.

where K ij = K(t i , t j ) with K(·, ·) as a kernel function and I is an identity matrix. λ represents the precision of data noise. Without loss of generality, we assume that there is a data set of free energies A = (A 1 , ... , A n ) at s = (s 1 , ... , s n ) . If we want to estimate the free energy A * at a new location s * , the joint probability p(A * , A) follows Eq. (33) and the conditional probability p(s * |s) becomes another normal distribution with mean μ * and covariance σ * given by

where k i = K(s * , s i ). The averaged prediction (Eq. 34a) is the same as a KRR prediction. Besides generating an averaged prediction, GPR also provides the confidence of a prediction. The GPR method has been used to train a FES [138] with free energies as well as mean forces [138] .

6 Challenges and perspective

Usually, exact kinetic information like transition probabilities or correlation functions are calculated from unbiased MD simulations with methods like Markov state models [146, 147] or trajectory-based enhanced sampling methods like milestoning [25, 50] . Evaluating exact kinetics from biased enhanced sampling simulations is attractive as it opens a door to introduce highly efficient biased enhanced sampling approaches to build kinetic models. Moreover, exact kinetics is needed for some CV construction methods like TICA [113, 114] .

Since most biased enhanced sampling methods are only designed to calculate thermodynamic properties, efforts are needed to develop efficient and accurate algorithms to unbias dynamical information. In WTM, unbiased correlation functions can be evaluated by a change of variable on time [76, 118] . Time is "compressed" in metadynamics since a biasing potential accelerates the simulation. A Properly designed change of variable on time can recover the unbiased timescale. This approach has been applied to constructing TICA CVs from metadynamics simulations [118] . However, the formula of the change of variable is asymptotic and it is only valid in the long time limit [76, 118] . Another way to obtain correct kinetic information with metadynamics is to unbias the dynamics of metadynamics with transition state theory [35] . In this approach, Gaussians are deposited less frequently to avoid biasing the transition state ensemble during a simulation and unbiasing is achieved by reweighing the partition function of reactant conformations. The accuracy of this method depends on the frequency of Gaussian deposition, which requires careful tuning and testing [148] . A more general approach to obtain accurate kinetic information is to unbias metadynamics path probabilities with the Girsanov theorem [149, 150] . The Girsanov theorem evaluates a change of path probability associated with one diffusion process, while the path itself is generated by another diffusion process. Although this approach does not require any modifications of the metadynamics algorithm, it limits the dynamical equation to be Brownian dynamics [149] or Langevin equation [150] , which rules out deterministic thermostats. Also, the unbiasing formula depends on the integrator used in a simulation [150] , thus updating the formula is needed if a different integrator is used.

Recently, evaluation of path probability with a path integral formula has been integrated to metadynamics with a polymer model [151] . This model requires simulating multiple replicas of a system and it needs further benchmarks on simulating more challenging systems like biomolecules or materials. In summary, methods introduced in this paragraph suggest that it is highly non-trivial to evaluate kinetics from biased enhanced sampling simulations and it requires further developments on both theory and numerical algorithms.

Besides unbiasing kinetics with physical principles, it is also interesting to explore machine learning related theories and methods to learn kinetic properties from biased enhanced sampling. There is a likelihood function to estimate transition matrix P from a counting matrix C in Markov state model with unbiased MD simulations, i.e.

where P ij is the transition probability from the i'th state to the j'th state and C ij is the number of transitions from the i'th state to j'th state [152] . This formula can be extended to equilibrium multiensemble simulations with different temperatures or different biasing potentials [153] . Although this approach has been applied to umbrella sampling, it is not clear whether this approach can be extended to quasiequilibrium enhanced sampling like metadynamics, d-AFED/TAMD or ABF. Therefore, developing machine learning techniques to learn a kinetics model from quasi-equilibrium enhanced sampling simulations is another open problem.

Up to now we have discussed developments and challenges to obtain unbiased kinetics information from biased enhanced sampling simulations. Due to difficulties of unbiasing kinetics, methods have been proposed to train CVs with approximate kinetics information. Although various approximations of kinetics have been used in different CV-training methods, [71, 84, 87, 94] , the trained CVs work quite well in all of these methods. Actually, it is not clear whether a rigorous kinetics model is necessary for training CVs, and the relationship between qualities of CVs and approximations in kinetic models is not clear. It is possible that the capability of trained CVs to represent minimum free energy paths is an important factor to the effectiveness of trained CVs [84] . However, systematic studies are needed to answer this question.

Section 4 introduces various methods to construct CVs as a low-dimensional representation of a system. One key question is how to estimate the dimensionality of the low-dimensional representation, i.e. what is an optimal number of CVs. It has been believed that configurations in R 3N sampled from MD simulations stay closely to a d dimensional manifold where d, named as intrinsic dimensionality (ID), is much smaller than 3N [154] [155] [156] . Recently an elegant algorithm has been developed to estimate the intrinsic dimensionality d [156] . The idea is to test a ratio α = r 

In practice the empirical cumulative distribution is used to replace F c (α) and d is recovered with linear regression. This method has been applied to several biomolecule systems. For example, the ID of RNA trinucleotide AAA with 98 atoms is ∼ 10 [156] . The ID of a mini protein, FiP35 WW domain, is around 14 [131] . A study on Villin headpiece folding free energy landscape suggests the ID for this system is around 12 [157] . However, there is no guarantee that ID is always around 10 for more complex systems. A recent study on SARS-CoV-2 main protease which is a homodimer with 306 residues in each monomer suggests that the ID is about 27 for each monomer [158] . Therefore, applying enhanced sampling techniques to more challenging problems like protein complexes requires large number of CVs even if these CVs are optimal. Difficulties to explore a high-dimensional FES may show up in the sampling step or in the FES construction step. For example, d-AFED/TAMD simulations are feasible with many CVs [16, 27] . Calculating a free energy difference between two points in the CV space is also trivial. However, unbiasing samples from d-AFED/ TAMD simulation is extremely hard with a large number of CVs since it requires full knowledge of the FES [78] . A common implementation of Metadynamics has difficulties to construct biasing potential with ≥4 CVs as the biasing potential is usually stored and evaluated on grids [159] . Similar difficulties also exists with the ABF method [160] . Progress has been achieved to push biased enhanced sampling method working with a highdimensional FES. For example, machine learning-based biasing potentials have been developed with the Gaussian Mixture model [70] , the artificial neural network model [144] , and the kernel density estimation method [161] . Similarly, training a high-dimensional FES from d-AFED/ TAMD with various machine learning models has also been tested [142] , as introduced in Sect. 5.

Either training an accurate high-dimensional FES or learning an accurate high-dimensional probability is difficult. First, free energies and/or FES derivatives are needed prior to FES training in a supervised learning approach. However, obtaining accurate free energies and derivatives are non-trivial with large d. Samples tends to be "sparse" in a high-dimensional space. For example, assuming a quadratic FES A(s) = 1/2 s 2 where s ∈ R d , the Boltzmann distribution is simply a standard normal distribution. The distance between two independent samples s 1 and s 2 increases with d on average, i.e. E( s 1 − s 2 2 ) = 2d. The sparsity suggests that a large bin size or a large kernel bandwidth is required. However, larger bin size or larger kernel bandwidth leads to larger systematic errors in calculating free energies and FES derivatives.

Second, samples on a high-dimensional FES are typically distributed with a "spider web" structure [73] . Clusters of samples are located at local minima on the FES together with free energy paths to connect different minima. However, biased enhanced sampling methods, including metadynamics, d-AFED/TAMD, and ABF, attempt to uniformly or near-uniformly sample the CV-space, which is very challenging on a high-dimensional FES as the volume of the CVspace increases exponentially with the dimensionality d. Therefore, efficiently sampling on a high-dimensional FES requires focusing on exploring minima and transition paths on the FES instead of uniformly converging the FES [69] . For example, from our experiences the d-AFED/TAMD method usually discourages to use extremely high CV temperature with a large number of CVs in order to avoid spending too much time on exploring high-free-energy regions. The tradeoff between enhancing barrier crossing and avoiding exploring high-free-energy regions requires fine tuning on sampling methods.

Finally, the "spider web" type of configuration distribution suggests that data for training CVs or biasing potential only cover a small percentage of the CV-space. An enhanced sampling simulation may lead the system to explore regions on the FES that are not supported by data. Applying the machine learning-based biasing potential or machine learning-based CVs in these regions means extrapolating the model which could perform poorly. Therefore, one difficulty of exploring a high-dimensional FES with machine learning methods is to appropriately deal with the extrapolation problem. For example, a biasing potential in the form of kT log(ρ(s) + ε) with a shifting constant ε can be used to avoid errors in the trained ρ(s) from damaging MD simulations, especially in the places where ρ(s) is small [71, 133] . In another study, Ensemble learning has been applied to avoid applying biasing potential to unexplored conformations [141] . In this approach a biasing potential is switched off if the variance of trained biasing potential is too large. The same extrapolation problem has been discussed in applying StKE CVs [71] . A physically intuited CV was combined with StKE CVs during the enhanced sampling simulation to maintain high sampling efficiency when sampling new conformations. However, systematic studies are still needed to improve the extrapolation capability or the generalizability of the machine learning-based enhanced sampling approaches on a high-dimensional FES.

Besides developing enhanced sampling methods to overcome challenges of exploring a high-dimensional FES, developing theories and methods to unify enhanced sampling and building a coarse-grained (CG) model is another interesting direction. Training a high-dimensional FES is closely related to constructing a coarsegrained (CG) model with some bottom-up approaches [162] [163] [164] [165] . In this scenario CG degrees of freedom become CVs while the interaction potential of the CG model is a FES. Since a CG model requires transferability, CG degrees of freedom are usually center of atom groups. Some CG potential fitting method, like force matching [162] [163] [164] , is very similar to training a FES via mean forces. Recently, machine learningbased CG potentials have been developed with significantly improvement on accuracies [166] [167] [168] [169] . However, limited studies have been proposed to unify enhanced sampling and CG model building [169] . There are two possibilities of integrating enhanced sampling and CG model. (1) Use enhanced sampling to build a transferable CG model. In order to transfer the CG model to other systems, a CG degree of freedom should be a local CV that is a function of a group of neighborhood atoms. Moreover, a few-body interaction potential is needed for the CG potential energy function. In a recent study, machine learning-based CG potential energy functions with few-body (two-body to fivebody) interactions have been proposed [170] . Samples from CG potential and samples from all-atom simulations have been projected onto TICA CVs to generate a CG FES and an all-atom FES. With the five-body potential, the CG FES agreed with the all-atom FES reasonably well. This study suggests that it is possible to train a CG model with few-body interactions. (2) Use all-atom MD simulations to train a system-specific CG model and apply this model to enhance all-atom MD simulations of the same system. In this case, transferability may not be required. For example, applying the CG potential energy function as a biasing potential [169] can use a CG potential energy function with full-body interactions. However, transferable CG model is still needed if the CG model is designed to predict conformations that have not been sampled.

The idea of CV-based enhanced sampling is analogous to reinforcement learning [141, 169] : a high CV temperature or a biasing potential acts like a policy to guide the exploration of the configurational space while MD serves as an "explorer" to discover new conformations or to recurrently visiting conformations to improve statistics. However, efficiency of a MD simulation is limited. A typical time step of a MD simulation for molecules is about 0.5-2 fs. A nearly independent configuration can be sampled every ∼1 ps, which requires ∼ 1000 times of force evaluations. Massive force evaluations, especially non-bonded force calculations, are the most time-consuming steps in a MD simulation thus these calculations significantly reduce the sampling efficiency. One way to enhance the efficiency of MD simulations is to increase the time step. For example, the time step to evaluate non-bonded interactions can increase up to ∼100fs in a simulation with the isokinetic ensemble [171, 172] . However, further improvements on all-atom sampling efficiencies are still needed.

Although there are very limited studies on sampling all-atom configurations with machine learning methods, these developments are changing the world of all-atom sampling [95, 111, [173] [174] [175] [176] . Here we will briefly discuss some methods as interesting directions. We shall start the discussion from the autoencoder method which has been introduced in Sect. 4. While the encoder part is used to construct CVs, the decoder part is able to generate atomic structures for given CV values. We want to emphasize that fixed values of CVs correspond to an constrained ensemble of x. Therefore, the generated structures should be randomly sampled from the ensemble. In one work, the trained variational autoencoder model can generate random structures of alanine dipeptide [95] . However, nonphysical features such as collapsed atoms can be found on generated structures. In another work the decoder was trained by matching a decoder output with the corresponding encoder input [111] . In this case a structure from decoder is more like an "interpolation" of simulated structures with similar CV values. Nevertheless, the generated structures are valuable as initial structures for further refinement.

Besides generating atomic structures from CVs, sampling atomic structures directly is much more challenging. A groundbreaking method, named as "Boltzmann generator", has been proposed by using normalizing flow [173] . Normalizing flow [134] learns a change of variable function f (·) represented by a neural network, i.e. x = f (z) for x ∈ R 3N and z ∈ R 3N . f (·) has to be a bijection function with a Jacobian matrix ∂x/∂z. If z is distributed as the probability ρ z (z), the induced probability distribution of x, ρ x (x), is given by

where x = f (z). With careful design, a normalizing flow neural network is reversible and its Jacobian matrix is a product of triangular matrices with a trivial determinant. ρ z (z) is usually a standard multivariate normal distribution. Normalizing flow is a very powerful generative model since it can generate independent samples of x just by sampling a normal distributed random variable z followed by transforming z to x. However, there are many unsolved problems of this model. For example, applying this model with explicit solvent is still a challenging problem. Also, sampling with normalizing flow still requires reweighting, which suggests that improvement of accuracy is needed [177, 178] . There are some general open questions for machine learning-based all-atom samplers. The first one is how to efficiently combine these samplers with enhanced sampling algorithms. The answer for autoencoder is relatively straightforward since the concept of CV is already in the model. Also, the configuration probability conditioned on given CV values is approximately unbiased in many enhanced sampling methods, which makes it easier to train an autoencoder model. The second question is about the accuracy of extrapolating the sampler to new conformations that are not in the training dataset, which requires further studies.

In this perspective we have reviewed recent developments on combining machine learning methods with CV-based enhanced sampling, especially in CV construction and FES training. Although introducing machine leaning algorithms to sampling has achieved great successes, many challenges still exist, which include generating accurate kinetic information from biased enhanced sampling simulations, exploring a high-dimensional FES, and developing machine learningbased all-atom samplers for CV-based enhanced sampling. Integrating machine learning techniques with enhanced sampling is often beyond applying generic machine learning algorithms directly. Developing physical theories and enhanced sampling methods to collaborate with machine learning techniques is also needed. Finally, building machine learning models tuned for molecules and materials is also important for unifying machine learning with CV-based enhanced sampling methods [179] .

Molecular Dynamics Simulation

A Review of Enhanced Sampling Approaches for Accelerated Molecular Dynamics

Pattern Recognition and Machine Learning

Deep Learning

Deep Sparse Rectifier Neural Networks

Deep Residual Learning for Image Recognition

Edinburgh Dublin Philos. Mag

Nonlinear Dimensionality Reduction

Stochastic Neighbor Embedding

Mixture Models (Inference and applications to clustering

Equivariant Flows: Exact Likelihood Generative Learning for Symmetric Densities

Advances in Neural Information Processing Systems

MC wrote and revised the whole paper.

This manuscript has no associated data or the data will not be deposited. [Authors' comment: Data is available upon request from the authors.]