key: cord-0273270-yxo8oanc
authors: Punjani, Ali; Fleet, David J.
title: 3D Variability Analysis: Resolving continuous flexibility and discrete heterogeneity from single particle cryo-EM
date: 2020-11-17
journal: bioRxiv
DOI: 10.1101/2020.04.08.032466
sha: 80615c806480bc821be983aa6ad97877bd8588ba
doc_id: 273270
cord_uid: yxo8oanc

Single particle cryo-EM excels in determining static structures of protein molecules, but existing 3D reconstruction methods have been ineffective in modelling flexible proteins. We introduce 3D variability analysis (3DVA), an algorithm that fits a linear subspace model of conformational change to cryo-EM data at high resolution. 3DVA enables the resolution and visualization of detailed molecular motions of both large and small proteins, revealing new biological insight from single particle cryo-EM data. Experimental results demonstrate the ability of 3DVA to resolve multiple flexible motions of α-helices in the sub-50 kDa transmembrane domain of a GPCR complex, bending modes of a sodium ion channel, five types of symmetric and symmetry-breaking flexibility in a proteasome, large motions in a spliceosome complex, and discrete conformational states of a ribosome assembly. 3DVA is implemented in the cryoSPARC software package.

Protein dynamics hold the key to understanding function in many molecular machines. Single particle cryo-EM collects thousands of static 2D protein particle images that, in aggregate, span the target protein's 3D conformational space. Cryo-EM thus holds great promise for studying protein dynamics [22] . However, there are significant computational challenges in resolving dynamics from static image data. These difficulties stem from the need to simultaneously estimate the canonical structure of the molecule, the multiple ways in which the structure can change (due to motion, dissociation, etc), and the state of the molecule in each particle image. These problems are exacerbated by high levels of noise in the imaging process and the need to also estimate, for each particle image, the CTF corruption and 3D coordinate transform between the particle in each image and the 3D model.

A heterogeneous target molecule will exist in different conformational states within a sample. These 3D structures lie on a manifold in the abstract space of all possible 3D structures. Each observed particle image X provides a noisy, CTF corrupted, 2D projection of the 3D structure V at a point on the manifold. The manifold may have a non-linear geometry in general, but for many molecules of interest a linear subspace model provides an effective way to represent the 3D structures in a cryo-EM sample. A K-dimensional linear subspace represents 3D structures present in the sample in terms of a base structure V 0 and a weighted sum of K components, V k :

The weight vector, z = (z 1 , ..., z K ), or latent coordinates, represent the conformational state of the molecule in a given particle image. Subspace models, such as Principal Component Analysis (PCA), have been discussed extensively in the cryo-EM literature [1, 30, 45, 47] . They are attractive for their simplicity, analytical properties, and the existence of stable numerical methods for eigenvector computation. Nevertheless they are not widely used in practice (e.g., compared to K-way discrete 3D classification [12, 39, 41, 42] ), as existing methods cannot estimate the components in (1) at high resolutions, nor estimate several components simultaneously. As a consequence they fail to resolve the detailed conformational changes that are most important at high resolutions.

We introduce 3D Variability Analysis (3DVA) for fitting 3D linear subspace models to single particle cryo-EM data. Following Tagare et al [45] , the algorithm is formulated as a variant of the Expectation-Maximization algorithm for Probabilistic PCA. By decoupling specific tasks in the numerical optimization, we obtain a method that enables efficient computation of accurate, high-resolution representations. Through experimental results we show that 3DVA allows for directly resolving molecular motion, flexibility, changes in occupancy, and even discrete heterogeneity of a wide range of protein molecules. It resolves high-resolution flexibility, with motions of individual α-helices as small as a few angstroms, for both large and small proteins. To our knowledge, this is the first method to resolve flexible dynamics of small proteins, such as the distinct types of flexible motion we find in the sub-50 kDa transmembrane region of a GPCR complex, at resolutions below 4Å.

By resolving continuous conformational changes, 3DVA reveals biological insights from single particle cryo-EM data beyond the reach of existing methods, simplifying the analysis of conformational heterogeneity. We show that it can resolve ratcheting motions of a ribosome, large flexible motion of a pre-catalytic spliceosome complex, and several discrete conformational states of a bacterial large ribosomal subunit assembly. For such large, multi-MDa complexes, recently proposed deep generative models [52, 51] have been experimentally demonstrated to resolve similar continuous and discrete heterogeneity at coarse resolution. However, 3DVA is 20-50x faster to optimize, requiring no manual parameter tuning or hyperparameter optimization on individual datasets.

3DVA was released in the cryoSPARC [33] software package v2.9+ and has been used in several structural studies, including the SARS-CoV-2 virus spike protein [49] , mTORC1 docked on the lysosome [37] , human oligosaccharyltransferase complexes OST-A and OST-B [35] , bacterial unfoldase-protease complex ClpXP [36] , a viral chaperonin [44] , the Acinetobacter baumannii 70S Ribosome [25] , and Adrenomedullin Receptors AM1 and AM2 [21] .

Under the standard cryo-EM image formation model [8, 13, 32, 40] , a 2D particle image, X i , is a corrupted projection of the target 3D density V from an unobserved pose φ i , plus additive noise, η:

where P (φ i ) is the projection operator and C i is the contrast transfer function (CTF) for image i. The standard model assumes all images are generated from a single 3D density V. It assumes a homogeneous population of molecules, all in the same conformation, differing only by a rigid coordinate transformation. K-way 3D classification methods [12, 39, 41, 42] extend this mode, assuming each particle image is generated from one of K different, independent, 3D densities. This assumes the sample comprises a small number of distinct structures.

Many protein molecules have a mechanistic function, exhibiting flexible motion across a continuous landscape of energetically favourable conformations. This flexibility can result in a continuum of conformational states present in a frozen sample. In such cases, algorithms based on assumptions of the homogeneous or K-way classification models yield low quality reconstructions that do not shed light on the biological function of molecular motions.

3D variability analysis accounts for continuous conformational flexibility in terms of a K-dimensional linear subspace model. Under a variant of the standard probabilistic PCA model, each experimental particle image is generated from a mean density, V 0 , plus weighted contributions of several variability components, V 1:K . The model also includes a per-particle scale factor, α i , to account for varying ice-thickness and contrast level, and additive Gaussian noise, η. In particular, η is assumed to have zero mean and a diagonal covariance matrix in the Fourier domain, with constant variance in annular rings. White noise is a special case. Formally,

Each V k is a 3D density, capturing a particular type of change to V 0 . The vector of per-particle weights, z i = (z i1 , ..., z iK ) are also known as the latent coordinates.

The predominant method for learning linear subspace models in many domains is Principal Component Analysis (PCA) [2] , under the assumption that the basis functions are orthogonal; i.e., V T i V j = 1 if i = j, and 0 otherwise. Given a random data sample (e.g., 3D densities), assumed to be generated independently from the same underlying distribution, for PCA one would computes the sample mean (V 0 in (3)), and the leading eigenvectors of the covariance matrix of the data distribution (i.e., those with the largest eigenvalues). These principal directions account for the most significant factors of variation in the data (like V 1:K in (3).

Direct application of PCA to single particle cryo-EM data is challenging however. First, each 2D image is a partial observation of the 3D density from one direction, so construction of the full covariance of 3D densities from images is non-trivial. Second, the high dimensionality of the covariance matrix means that it may not be possible to store in memory, or to compute its eigenvectors except at low resolutions. Third, each image is corrupted by a different CTF, violating the PCA assumption that data are generated from the same distribution.

To mitigate these challenges one could restrict analysis to 2D by computing the 2D covariance of particles within a single 2D class or viewing direction. This removes the partial observation problem, but fails to capture 3D variability or account for the CTF [47] . In 3D one can use bootstrapping [30] , wherein multiple random subsets of images are used to reconstruct multiple 3D structures under the homogeneous model (Eq. 2). With such "bootstrapped" 3D densities one avoids the missing information problem due to projection, but it can only operate at low resolutions to avoid high-dimensional covariance matrices. It further suffers from statistical inefficiency due to the random noise injected into each bootstrap sample. Together, these drawbacks limit such methods to resolving coarse resolution eigenvectors that capture gross structural changes of large protein complexes [30] . One can also employ low-dimensional approximations to the covariance, which mitigate storage and computational cost, but such methods have been demonstrated only at coarse resolutions [1] .

As explained in the Methods section, 3D Variability Analysis overcomes the challenges with standard PCA techniques, enabling computation of variability components at high resolution and for smaller proteins. 3DVA is a form of Probabilistic PCA [38, 46] , which assumes data are drawn from a high dimensional Gaussian distribution. The model assumes Gaussian observation noise in (3), and a Gaussian prior over latent coordinates. The Expectation-Maximization algorithm [7] is used to obtain a Maximum Likelihood (ML) estimate of the variability components V 1:K , along with ML estimates for the noise covariance and the latent coordinates z. Crucially, this PPCA algorithm accommodates partial observations and data-specific corruption (the CTF in cryo-EM data), and it works well with high-dimensional data without the need to explicitly store or approximate the 3D covariance matrix [38] . It also supports direct estimation of only the top K components, as desired in 3DVA. For a set of M images, the log likelihood objective is, up to an additive constant, a weighted sum of squared residual errors between images, X i , and model predictions:

where Λ is the inverse covariance (a.k.a. precision matrix) of the observation noise η, and v 2 Λ ≡ v T Λv. The 3DVA algorithm assumes the mean density, V 0 , along with the per-particle CTFs, C i , and poses, φ i , are known. These quantities could be estimated by processing the image data using standard homogeneous refinement.

Given these quantities, and random initial values for the variability components, 3DVA uses iterative optimization, each iteration of which has an E-step and an M-step. As explained in Methods, based on a free-energy derivation of the algorithm, the E-step updates the mean of the posterior distribution over the latent coordinates, z i , independently for each image X i . The M-step then uses the expected log liklihood to update the per-particle scale parameter α i , and the variability components, i.e., V 1:K .

Tagare et al. [45] proposed a similar subspace model, using Maximum Likelihood estimation with a form of Expectation-Maximization. However, their M-step is solved via an approximate iterative conjugate gradient, and each V k is solved sequentially, requiring K complete rounds of optimization, each passing through the data repeatedly until convergence for each V k . Furthermore, the CTF is treated approximately using Wiener filtering, image noise is assumed to be white, and particles are binned into a finite number of 3D pose bins. Thus, they were only able to compute linear subspace models for coarse low-resolution motion and variations; results on experimental data are demonstrated only at 15Å resolution, and only for K = 2 variability components.

3DVA computes the M-step exactly via matrix operations, without approximate methods, and all K variability components are solved simultaneously at full resolution. In addition, per-particle scale factors α i and a colored noise model are optimized in parallel. During optimization, the variability components V 1:K are regularized by a low-pass filter and high-pass filter. The low-pass filter attenuates high-frequency noise to prevent over-fitting.

An optional high-pass filter removes sources of low-resolution variability caused by contaminants (e.g. denatured protein mass at air-water interfaces).

A GPU implementation of 3DVA in cryoSPARC allows one to process large experimental datasets with 10 6 particle images. It resolves multiple variability components simultaneously at resolutions as high as 3Å-4Å, limited only by GPU memory and signal present in the data.

When 3DVA is applied to single particle EM data, it outputs variability components, V 1:K , and per-particle latent coordinates, z. Variability components are those directions in the space of 3D density that, from the mean structure, V 0 , define a linear subspace that best fits the principal directions of variation in the data. Each component is a 3D

voxel grid of density values. These value indicate where density should be added or removed to explain variability amongst the particles. Orthogonality of the K components ensures that each explains a different mode of variation. 3DVA also yields latent coordinates, z, for each particle image. They indicate the level to which each variability component is present in a given image. If the ith particle image has a large positive value for z ik , this means that it is best described by adding a large amount of V k to V 0 . The variance of each latent coordinate across the images provides a measure of importance, since components with large variance explain the most variability in the data. When a population of particle images contains well-separated clusters of different conformations, 3DVA components will identify the differences between clusters, and the latent coordinates of particles will be similarly clustered, providing insight into discrete heterogeneity.

3DVA is not sensitive to initialization bias. Generally, iterative algorithms that solve non-convex optimization problems (like ML estimation of the linear subspace model) can give incorrect results if initialized poorly. It has been proven, however, that for PPCA models (like 3DVA) any stable local optima is also a global optimum of the objective function [46] . This means that no matter how V 1:K are initialized at the start of optimizing 3DVA, the results will be equivalent. This is in contrast to existing methods for K-way 3D classification and other non-linear methods, where initialization is critical and multiple runs can often yield significantly different results.

It also notable that 3DVA, using Expectation-Maximization, is a Krylov subspace method [38] . Krylov subspace methods are a general class of iterative optimization algorithms for computing the eigenvectors of matrices, used in popular linear algebra techniques like SVD [10] . Krylov subspace methods like 3DVA have the beneficial property of solving eigenvectors in order of importance. Running 3DVA with K = 3 will yield the top three variability components that explain the most variability in the data, and running with K = 6 will yield the same three plus the next three important components.

With these properties and the stability of Expectation-Maximization, 3DVA does not require tuning or parameter changes between datasets, other than resolution limits and the number of components, K. Since 3DVA depends on the quality of the mean density, V 0 and the particle alignments, methods that produce high quality consensus structures and alignments in the presence of flexibility and heterogeneity, e.g., non-uniform refinement [34] , can improve the results of 3DVA.

In what follows, we consider one synthetic dataset and six real-world cryo-EM datasets. Results on the synthetic dataset help to validate the approach and convergence of 3DVA to the true underlying principle linear subspace of the data. Results on the real-world datasets demonstrate the ability of the method to resolve high resolution continuous and discrete conformational changes from single particle cryo-EM.

For the real-world experimental data we first compute a consensus refinement using all particle images. In the case of membrane proteins, non-uniform refinement [34] is used to improve image alignments and overall resolution. The resulting particle images and pose alignments are then used to compute variability components and latent For data with continuous heterogeneity, we render the 3D density generated by the subspace model at positive (blue) and negative (red) values of the latent coordinate for individual variability components, along with a superimposed rendering of both densities to clearly indicate the difference between them. Guide lines (solid and dashed) and feature markers (+) provide visual reference points. For each dataset, videos displaying flexible conformational changes are available as supplementary movies.

We generate synthetic data by deforming the 3D density map of a T20S proteasome [3] at 3Å. The ground truth deformations lie in a three dimensional linear subspace, the basis functions for which correspond to bending along each of the x, y, and z axes. Figure 2 shows 3D renderings of the consensus refinement plus/minus (red/blue) one standard deviation along each of the ground-truth subspace directions (top row), along with slices in the x − y plane through each of the ground-truth subspace directions (bottom row).

Each synthetic dataset comprises 100, 000 particle images. For each particle image, we randomly sample ground-truth latent coordinates from a mean-zero Gaussian distribution with a diagonal covariance matrix, the diagonal variances of which are 50 2 , 40 2 and 30 2 , corresponding to bending along the x, y, and z axes. The corresponding deformation is then applied to the T20S density map. A viewing direction is then randomly drawn from a uniform distribution over the sphere. For CTF parameters, average defocus is drawn at random uniformly from the interval [0.5µm, 2.0µm], astigmatism from the interval [−0.1µm, 0.1µm], and astigmatism angle uniformly around the circle. Accelerating voltage is set to 300kV, spherical aberration to 2.7mm, and per-particle scales to 1.0. Particle images are generated using Eq. 3.

Particle images are collectively normalized to have unit signal variance on average. White noise is them generated and added to the particle images. To this end we use four noise levels, namely, σ = 0, 4, 10, and 20, corresponding to noiseless data, and SNRs of 1/16, 1/100, and 1/400 respectively. This yields four datasets, each with 100,000 particle images, that are used to test 3DVA (see Figure 3a) .

At each noise level, we run 3DVA using the ground-truth poses and CTF parameters for each particle, estimating 3 variability components. A consensus reconstruction of input particles is used as the mean density V 0 .

Low-pass filtering is not used, and initialization is random. In all cases, 3DVA correctly recovers the true subspace, as well as the latent coordinates of particles. Figure 3b shows slices through the x − y plane for estimated variability component 1 at each noise level. These figures can be compared against Figure 2 (bottom left). In all cases, 3DVA resolves the ground-truth subspace direction. Noise is present in the estimated variability component due to the inherent noise in the input images. Figure 3c shows scatter plots of the true latent coordinates for component 1 against the estimated coordinates for component 1. When the noise level is low or zero (left), latent coordinates are recovered with nearly perfect correlation to the true coordinates. As noise increase the estimates become more varied, but correlation with the true coordinates remains high. Figure 3d shows values of the correlation coefficient between ground-truth subspace directions (x-axis) and estimated variability components (y-axis) at each noise level. These results show that in all cases, 3DVA recovers the three subspace directions in the correct order, and without significant confusion between the directions. In the noiseless case, correlations are nearly perfect. In the three noisy cases, correlation decreases as the variability components are affected by noise.

Despite the noise in input images, 3DVA estimated variability components do recover the true signal in the subspace directions at high resolutions. To see that this is the case, we first compute FSC curves between the consensus reconstruction using particles at each noise level and the ground-truth concensus reconstruction ( Figure   4a ). This baseline indicates the degree to which noise limits the algorithm's ability to recover signal from the finite number of images. As expected, the noiseless reconstruction has nearly perfect correlation at all frequencies against the ground-truth. As the noise level is increased, resolvable signal decreases and resolution is limited.

Next, we compute FSC curves between the 3DVA estimated variability components and the ground truth subspace directions. Figure 4b -d show the FSC curves for components 1, 2, and 3 respectively. Again, we see that in the noiseless case, 3DVA recovers the true subspace directions with nearly perfect correlation at all frequencies.

As noise increases, components are resolved with decreasing resolution. Notably, the decrease in resolution of variability components mirrors the decrease in consensus refinement resolution, indicating that the inherent noise in particle images is the main limiting factor for 3DVA estimates, rather than the 3DVA algorithm. In particular, in the noisiest case of σ = 20, the consensus reconstruction is limited to an FSC=0.5 value of 4.0Å while all three subspace components reach nearly the same resolution, 4.1Å. data, noisy data with σ = 4, σ = 10, and σ = 20. 3DVA is run with particles at each noise level, solving three variability components. a: Example synthetic images in the dataset. Raw particle images are shown above, and filtered images (with low-pass filter of 30Å) below. In the noiseless case, ringing from the CTF is visible in the raw particle images. As the noise level is increased, the particle becomes less distinguishable even after filtering. b: Results of 3DVA used to recover of substantial noise, on the synthetic T20S datasets. a: FSC curves between the ground-truth consensus density map and consensus reconstruction from particle images, at each noise level. These curves serve as a baseline, indicating the level to which reconstruction from the finite, noisy particle set can yield resolvable signal. As expected, noiseless data reconstructs nearly perfectly at all resolutions. As noise is added, resolvable resolution decreases. b,c,d: FSC curves between the groundtruth linear subspace direction and the 3DVA estimated variability component, for components 1, 2, and 3 respectively. In the noiseless case, 3DVA recovers the true subspace directions nearly perfectly. As noise as added, resolution of variability components decreases, mirroring the consensus resolution. Notably, variability components can be resolved by 3DVA at nearly the same resolutions as the consensus reconstruction, indicating that noise is the primary limiting factor rather than the 3DVA algorithm.

GPCRs are small membrane proteins responsible for cell signalling and transmission [15] . 3DVA is applied to a dataset of Cannabinoid Receptor 1-G GPCR complex particles [19] , containing the CB1 GPCR, G protein, and scFv. Raw microscope images (EMPIAR-10288) are processed in cryoSPARC v2 to obtain a consensus refinement of the entire structure (Fig. 5a ) from 250,649 particle images. When 3DVA is first run on the entire 3D map, the variability components are dominated by changes in the shape and position of the micelle (Fig. 5b and 5c) . To inspect the heterogeneity of the protein itself, a mask is used to exclude solvent and detergent micelle (Fig. 5d ).

3DVA is then run using three variability components and a low-pass filter resolution of 3Å.

For the Cannabinoid receptor ( Fig. 5e- 3DVA is, to our knowledge, the first method capable of resolving high resolution continuous flexibility for a protein as small as a GPCR complex. An early version of 3DVA in cryoSPARC was also used successfully to understand the differences in conformational dynamics between adrenomedullin 1 (AM1) and adrenomedullin 2 (AM2) receptors [21] .

Applied to 105,247 Pf 80S ribosome particles (EMPIAR-10028 [48] ), 3DVA with four variability components resolves four types of heterogeneity (Fig. 6 ). The first component identifies the presence or absence of a portion of the "head" region of the 40S small subunit. The second resolves rotational motion of the entire 40S subunit along an axis that connects the 40S and 60S subunits. The third component resolves lateral motion of the 40S subunit.

The fourth resolves transverse rotational motion of the "head" region of the small subunit, perpendicular to the rotation in the second component.

In this case, 3DVA was run with low-pass filter resolution set to 8Å. The four orthogonal types of heterogeneity in the molecule were solved in a single run of 3DVA, using the full resolution particles images, with a 360 pixel box size, taking 71 minutes on a single NVIDIA V100 GPU. As a comparison, on this dataset the recently proposed cryoDRGN deep generative model [52] was reported to resolve conformational changes similar to variability components 1, 2, and 4 [51] , but required model training for 150 epochs taking over 60 hours (i.e., 50× more than 3DVA) on the same hardware.

The Na v 1.7 channel [50] is a voltage-gated sodium channel found in the human nervous system. Na v channels are fundamental to the generation and conduction of action potentials in neurons. They are mutated in various diseases, and are targeted by toxins and therapeutic drugs (e.g., for pain relief). A dataset of 431,741 particle images of a Na v -Fab complex is created by complete processing in cryoSPARC v2 from raw data (EMPIAR-10261). This particle set is processed with standard 3D classification methods to separate the discrete conformational states of active and inactive channels. The active class contains 300,759 particles. The overall protein-Fab complex is C2 symmetric, and so the particle images are duplicated with their 3D poses rotated 180 o around the symmetric axis (i.e. symmetry expansion). The resulting 601,518 particles are then input to 3DVA, with six components and a low-pass filter resolution of 3Å. 

The T20S proteasome is a D7-symmetric large, stable protein that is commonly used as a test specimen for cryo-EM microscopes [3] . A dataset of 84,605 particle images of T20S is created by complete processing in cryoSPARC v2 from raw data (EMPIAR-10025). The particle images are duplicated around the 14-fold D7 symmetry (i.e. sym- Despite the generally stable nature of the T20S protein, 3DVA detects five types of continuous bending flexibility in the molecule (Fig. 7d-f ). This is an interesting case due to the high symmetry; three of the variability components (Fig. 7d, left) the symmetric axis. The last two components are asymmetric (Fig. 7d, right) , resolving bending of the entire barrel in two different directions (detail of component 4 in Fig. 7f ). Importantly, this shows that 3DVA can resolve several orthogonal modes of detailed high resolution flexible motion of larger molecules. It can also help detect pseudo-symmetries created by asymmetric flexibility of symmetric molecules.

On 327,490 particle images of a pre-catalytic spliceosome complex (EMPIAR-10180 [31] ), 3DVA was run with two components and a low-pass filter of 8Å, resolving two large motions of multiple parts of the complex (Fig.   8 ). The first component (Fig. 8b) (Fig. 8a, bottom) and 3D

reconstructions are created using those weighted particles. This type of local neighborhood weighting is commonly used in manifold embedding applications, and has been used previously in methods for manifold embedding of cryo-EM data [5] . The resulting 3D density maps depict the molecule at different positions along its continuous flexibility (Fig. 8d,e) .

The pre-catalytic spliceosome data was processed in 3DVA in 176 minutes. By comparison, on this data, the recently proposed cryoDRGN deep generative model [52] was reported to resolve conformational changes similar to the two variability components [51] , but required model training for 30 hours on less than half the number of particles (i.e., 20× slower than 3DVA) on the same hardware.

The bacterial large ribosomal subunit (LSU) is a large macromolecular complex; the cryo-EM dataset contains a mixture of assembly intermediates (EMPIAR-10076 [6] ). 3DVA is applied to 131,899 particles, with four variability components and a low-pass filter of 5Å. The 3DVA components lie in a subspace that spans several discrete 3D states. Within that subspace (Fig. 9a ) the latent coordinates of particle images are well clustered in multiple dimensions. We find that each cluster corresponds to a different partial assembly state of the complete LSU complex.

To analyze and understand the clustering of particles in the 3DVA latent coordinate space, we first fit an 8component Gaussian Mixture Model (GMM) to the latent coordinates (Fig. 9b) . This clusters particles into 8 major groupings. The particles from each cluster are then used to perform a 3D reconstruction; Fig. 9e shows the resulting 3D densities. Cluster 3 contains the least density, and each other cluster adds on a part to the assembly. Cluster 5 displays the entire LSU complex. Clusters 1, 2, 3 and 6 correspond to the four large 3D class subpopulations reported in the original study (C, D, B, E respectively) [6] . Cluster 8 appears to contain mainly outlier particles (contaminants and junk particles). The presence of outliers was also noted in the original study [6] . The identity of this cluster as outliers can be seen in the histogram of estimated per-particle contrast scales α in Fig. 9e , where particles in cluster 8 have a low estimated contrast (indicating a low level of agreement between the particles and the consensus 3D density) compared to all other clusters.

To further investigate and demonstrate the power of 3DVA components to capture discrete heterogeneity, two non-linear embedding methods are applied to the 4-D per-particle latent coordinates. First, a Variational Autoencoder (VAE) [18] is trained to reduce the 4-D latent coordinates to a 1-D embedding where individual clusters can be directly visually identified. The VAE is constructed using PyTorch [29] , with a single hidden layer of size Fig. 9c as a histogram of particle positions. The peaks correspond to the multiple discrete states and the outlier cluster detected by GMM fitting (with the same colors as Fig. 9b ). This result can be directly compared with the 1-D Spatial-VAE embedding carried out in processing of this same dataset using the cryoDRGN generative model [51] , where a similar histogram is generated that resolves four major sub-states, plus an outlier cluster.

A second non-linear embedding technique, UMAP [24] , is applied to the 4-D latent coordinates (Fig. 9d) , displaying the geometry of clusters in the 4-D space, e.g., that clusters 2, 4, and 7 are sub-clusters of a larger cluster. This can also be compared with results in [51] to illustrate the ability of 3DVA to resolve nuanced discrete conformational heterogeneity, despite the apparent simplicity of the linear subspace model.

The 3DVA algorithm enables one to fit high-resolution linear subspace models to single particle cryo-EM data.

The output variability components and associated latent coordinates reveal new biological insights from the data, including detailed flexible motion dynamics at the level of individual α-helices, even for small membrane proteins.

It can can also be used to identify discrete conformational states of proteins from heterogeneous samples.

Linear subspace models, such as PCA, Multivariate Statistical Analysis (MSA), and eigenanalysis have been explored for single particle cryo-EM. Early methods used bootstrapping 3D reconstructions from random subsets of images to coarsely simulate sampling from the conformational distribution of a target molecule [30] . Others attempt to construct a low-resolution complete covariance matrix of the 3D density map distribution underlying a sample, extracting a linear subspace using eigenanalysis [1] . A direct method has also been proposed to solve K linear subspace components via optimization [45] . Such methods are most closely related to 3DVA, as discussed in Results.

Non-linear manifold embedding techniques are also in development. Diffusion maps have been used in 2D, with multiple 2D manifold embeddings combined to resolve a 3D manifold of continuous conformational change [5, 9] . These methods have been demonstrated on experimental data of large proteins, and recent work aims to improve the difficult process of combining 2D manifolds to form a 3D model [23] . Non-linear methods employing the graph Laplacian of similar 2D particle images have also been developed [26] and applied to synthetic data. Most recently, non-linear, deep generative models, have been proposed and shown to resolve coarse-scale conformational changes from experimental cryo-EM data of large complexes [51, 52] . Methods that directly model protein motion have also been proposed, but only demonstrated on simplified synthetic data [11, 20] .

The predominant method currently used to resolve heterogeneity in cryo-EM data is K-way 3D classification [12, 39, 41, 42] . By comparison, a linear subspace model does not constrain particles to exist in a (small) finite number of conformational states. Rather, the subspace components span a continuous space of possible states. It is well-known in the cryo-EM field that clustering methods are not effective for resolving continuous flexibility.

Local refinement [12, 41] and multi-body refinement [27] methods allow high resolution refinement of some regions of flexible molecules, by assuming the molecule is composed of a small number of rigid parts. For successful 2D-3D alignment, such methods require that each region has sufficient mass for accurate local alignment, independent of the rest of the structure. 3DVA does not make this assumption, as it allows generic continuous flexibility within the subspace.

Techniques like normal-modes analysis [4] make assumptions about the energy landscape and dynamics of a protein molecule to predict possible flexibility. Methods have been proposed to exploit these fixed prior models to recover information from cryo-EM data of flexible molecules [43] . In contrast, 3DVA does not presuppose knowledge of the energy landscape, bending, or flexing of the molecule. Rather, it learns this from the image data.

Finally, linear subspace models have also been useful in cryo-electron tomography (cryo-ET), specifically in sub-tomogram averaging where multiple 3D reconstructions of individual molecules, are obtained. By averaging such reconstructions one can improve resolution. In this case, PCA can be directly applied [14, 16] since tomography provides complete 3D observations for each individual particle.

The 3DVA algorithm uses a linear subspace model of 3D structures that comprise the conformational landscape of a protein molecule. The true manifold can have non-linear and complex geometry, for which non-linear manifold models are likely required. For example, extensions of 3DVA using similar algorithmic techniques may be possible that allow for some forms of non-linearity. The development of more accurate and powerful generic models of conformational manifolds is only the first step in truly solving flexible protein structures. In addition to modelling the manifold, techniques must be developed that aggregate structural information across continuous conformational states along a manifold, to solve higher resolution 3D density maps of flexible molecules.

Following Neal and Hinton [28] , the Expectation-Maximization algorithm can be formulated in terms of the minimization of a free energy, which serves as an upper bound the negative log likelihood of the data under the model.

Given data x, model parameters θ, and latent (or missing) paramters z, from Jensen's inequality, that the free energy F bounds the negative log likelihood:

The bound is tight when q is the posterior over the latent variables, q(z) = p(z|x, θ). In that case,

where H is the differential entropy of the posterior, and Q is the expected complete log likelihood under the posterior:

For exponential families the posterior is specified in terms of sufficient statistics ψ (eg its first and second moments, or mean and covariance). Given the true posterior parameters, the bound is tight and the free energy equals the negative log likelihood; Ie, min ψ F(θ, ψ) = Q + H. It follows that minima of F with respect to ψ and θ are also minima of L with respect to θ. The EM algorithm can thus be expressed as coordinate descent on F. At iteration t the E step optimizes the tightness of the bound by computing ψ t = arg min ψ F(θ t−1 , ψ). The M step then optimizes the bound, i.e., θ t = arg min θ F(θ, ψ t ).

The usual generative model for probabilistic PCA is

where z ∼ N (0, I), and η ∼ N (0, I) is isotropic white noise. The model parameters θ include the matrix W and the noise variance . Under this model, the complete likelihood, over x and z, is

From this one can show that posterior p(z|x), a.k.a., the inference distribution, has the form:

For this PPCA modeel, it is straightforward to derive the E-step and M-steps for estimating θ [38, 46] .

It has been shown that PCA and a corresponding EM algorithm are the limiting case of PPCA as e → 0 [38] .

In this case one can show that the posterior covariance reduces to (W T W )W T , and the posterior distribution, in the limit, is a Dirac delta function:

As a consequence, given W t−1 from the previous EM iteration, the posterior mean in the E step is compute per data point as z t i = arg min z ||x i − (W t−1 z + µ)|| 2 . The M step then updates W t = arg min W i ||x i − (W z t i + µ)|| 2 . Notably, it has been shown that both PCA and PPCA (with finite ) produce the same subspace [38] . The main difference between PCA and PPCA is the estimation of the noise variance, which also entails shrinkage of the latent coordinates. versus O(M N 3 K). Since N K this difference is significant. Finally, we note that, although many subspace models assume W is orthogonal, this assumption is not strictly necessary; the formulation of the posterior mean above does not assume orthogonality.

The generative model for 3DVA is given in (3) . The noise and latent coordinates are assumed to be independent and Gaussian, while α and V are unknown parameters we wish to estimate. In particular, we assume a mean-zero, isotropic Gaussian prior for the latent coordinates z ∼ N (0, I K ), and we assume isotropic mean-zero noise η ∼ N (0, I N 2 ). In most cryo-EM datsets of interest a colored noise model is more appropriate than an isotropic noise model. This is important as it changes the relative influence of model fitting as a function of spectral wavenumber.

In practice in 3DVA we use a colored noise model. However, for convenience in what follows we assume the noise has been whitened. For example, if the noise is mean-zero Gaussian with a diagonal precision matrix Λ, then we simply multiply the image x and the CTF by Λ 1/2 . Formally, the whitened model is

where X i ≡ Λ 1/2 X i , and C i ≡ Λ 1/2 C i , and the noise is now isotropic η ∼ N (0, I).

Finally, for 3DVA we derive the algorithm under the assumption that the noise variance tends to zero. While it may seem counter-intuitive to assume that noise tends to zero for cryo-EM data, this allows us to use the limited case of the PPCA model above, which reduces to PCA. Crucially, as noted, PPCA with finite and PPCA with → 0 will both yield the same linear subspace [38] . This means that the variability components recovered by Accordingly, the E-step and M-step can be viewed as a form of iterative coordinate descent. The E-step minimizes the free energy to obtain the posterior mean for each particle image. The M-step then decreases the energy using coordinate descent with respect to the variability component V 1:K and the per particle scale factors.

Given M particle images, the energy function can be rewritten as follows:

where A i ≡ C i P (φ i ). The main challenges with the optimization stems from the large number of unknowns.

In the E-step, we estimate the posterior means, with one optimization per particle image, given the variability components V 1:K and the scale factors α i . To this end, let W ik = A i V k be the scaled projection of kth variability component according to the CTF and pose of the ith particle image. Then, the energy for the ith image is

where W i ≡ [W i1 , ..., W iK ] and z i ≡ (z i1 , ..., z iK ) T . The columns of W i are the projections of the K variability components according to the CTF and pose of the ith particle. The dimension of each column is N 2 , ii.e., the size of the image X i . So W is a N 2 ×K matrix, and W H i W i is a K ×K symmetric positive-definite matrix, where W H denotes the conjugate transpose of W. The posterior mean is given in closed form by

The first phase of the M-step minimizes the energy associated with each particle image in (14) to solve for the updated scale factors (given the most recent latent coordinates). This is given by

.

The second computation in the M-step involves the estimation of the variability component updates. To this end it is useful to rewrite the energy, now summed over all images, in a somewhat simpler form:

where D i ≡ X i − α i W i0 , and z ik is the kth latent coordinate for the ith particle image. To solve for variability components, we require the gradient of E with respect to each V j :

Setting the gradient to zero and rearranging terms yields a linear system of equations,

where H jk = i z ij z ik A H i A i and h j = i z ij A i H D i . Importantly, H jk can be represented as a diagonal matrix.

The CTF operator C i is diagonal, and depending on the form of the interpolation used, the product of P T i P i is either diagonal or approximately diagonal. As is standard in single-particle EM reconstruction algorithms, we approximate this product as a diagonal matrix.

Differentiating E with respect to all variability components, we obtain a block matrix system of equations: 

Exploiting the diagonal structure of individual blocks, H jk , we rearrange the rows and column of (19) to form a block diagonal system having N 3 decoupled, K×K linear systems, one for each wavenumber. Each K×K systems is used to estimate the Fourier coefficients at a specific wavenumber for each of the K variability components.

To rearrange the LHS of (19), the first K columns will contain the first column from each of the K blocks in the large block H matrix in (19) . The next K columns comprise the second column of each of the K blocks and so on. The rows of the large variability component vector are rearranged correspondingly. This places the Fourier coefficients associated with the first wavenumber together, followed by those associated with the second wavenumber next, and so on. Next we rearrange the rows of the matrix and the corresponding RHS elements The first K rows comprise the first row from each of the K blocks of the H matrix on the LHS of (19) . The next K rows comprise the second row of each block, and so on. We also rearrange the corresponding elements on the RHS. This rearrangement yields a block diagonal system, decoupling the estimation of different wavenumbers.

Finally should we want to impose orthogonality constraints, we could use Lagrange multipliers or constrained optimization, but with Expectation-Maximization it is sufficient to orthogonalize the variability components at the end of each iteration. It is done with the Gram-Schmidt algorithm [10] , which is efficient as it only requires inner products between variability components, and sclaing each variability component to have unit length, i.e.,

V H j V j = 1. While not technically required by the iterative solution of a linear subspace model, orthogonality improves computational stability by removing degrees of freedom to which the solution is invariant.

Iterations continue until convergence, or for a fixed number of iterations. The experiments here used 20 iterations. Once completed, we form the K×K covariance matrix of latent coordinates S = i z i z T i . The eigenvectors of S specify a rotation of the variability components and the latent coordinates that aligns the variability components with the principal directions of the data and ensures the latent coordinates are independent under the Gaussian model, as with PPCA. The per-particle scale factors α i are unchanged under this rotation.

Code Availability: The cryoSPARC software package is available for non-profit academic use at www.cryosparc.

com.

Structural Variability from Noisy Tomographic Projections

Pattern Recognition and Machine Learning

2.8Å resolution reconstruction of the Thermoplasma acidophilum 20S proteasome using cryo-electron microscopy. eLife, 4:e06380

Normal Mode Analysis: Theory and Applications to Biological and Chemical Systems

Functional Pathways of Biomolecules Retrieved from Single-particle Snapshots. bioRxiv

Modular Assembly of the Bacterial Large Ribosomal Subunit

Maximum likelihood from incomplete data via the em algorithm

Three-Dimensional Electron Microscopy of Macromolecular Assemblies

Continuous changes in structure mapped by manifold embedding of single-particle data

Matrix Computations

3-d understanding of electron microscopy images of nano bio objects by computing generative mechanical models

cisTEM, user-friendly software for single-particle image processing. eLife, 7:e35383

FREEALIGN: High resolution refinement of single particle structures

Clustering and variance maps for cryo-electron tomography using wedge-masked differences

Structure and dynamics of GPCR signaling complexes

emClarity: Software for high-resolution cryo-electron tomography and subtomogram averaging

Auto-encoding variational bayes

Structure of a Signaling Cannabinoid Receptor 1-G Protein Complex

Continuously heterogeneous hyper-objects in cryo-em and 3-d movies of many temporal dimensions

Structure and Dynamics of Adrenomedullin Receptors AM1 and AM2 Reveal Key Mechanisms in the Control of Receptor Phenotype by Receptor Activity-Modifying Proteins

Challenges and opportunities in cryo-EM single-particle analysis

Propagation of Conformational Coordinates Across Angular Space in Mapping the Continuum of States from Cryo-EM Data by Manifold Embedding

Umap: Uniform manifold approximation and projection for dimension reduction

Cryo-electron Microscopy Structure of the Acinetobacter baumannii 70S Ribosome and Implications for New Antibiotic Development. mBio

Cryo-em reconstruction of continuous heterogeneity by laplacian spectral volumes

Characterisation of molecular motions in cryo-em single-particle data by multi-body refinement in relion. eLife, 7:e36861

A View of the EM Algorithm that Justifies Incremental, Sparse, and other Variants

Pytorch: An imperative style, high-performance deep learning library

Identifying conformational states of macromolecules by eigen-analysis of resampled cryo-EM images

Structure of a pre-catalytic spliceosome

Building proteins in a day: Efficient 3d molecular structure estimation with electron cryo-microscopy

CryoSPARC: Algorithms for rapid unsupervised cryo-em structure determination

Non-uniform refinement: Adaptive regularization improves single particle cryo-EM reconstruction. bioRxiv

Cryo-electron microscopy structures of human oligosaccharyltransferase complexes OST-A and OST-B

A processive rotary mechanism couples substrate unfolding and proteolysis in the clpxp degradation machinery. eLife

Structural basis for the docking of mTORC1 on the lysosomal surface

EM algorithms for PCA and SPCA

Processing of structurally heterogeneous cryo-em data in RELION

A Bayesian view on cryo-em structure determination

RELION: Implementation of a Bayesian approach to cryo-em structure determination

Disentangling conformational states of macromolecules in 3D-EM through likelihood optimization

Structures of transcription pre-initiation complex with TFIIH and Mediator

Cryo-EM reveals an asymmetry in a novel single-ring viral chaperonin

Directly reconstructing principal components of heterogeneous particles from cryo-em images

Probabilistic principal component analysis

Multivariate Statistical Analysis of Large Datasets: Single Particle Electron Microscopy

Cryo-EM structure of the Plasmodium falciparum 80S ribosome bound to the anti-protozoan drug emetine. eLife

McLellan. Cryo-em structure of the 2019-nCoV spike in the prefusion conformation

Structural basis of Nav1.7 inhibition by a gatingmodifier spider toxin

CryoDRGN: Reconstruction of heterogeneous structures from cryo-electron micrographs using neural networks. bioRxiv

Reconstructing continuously heterogeneous structures from single particle cryo-em with deep generative models

Acknowledgements: We thank the entire team at Structura Biotechnology Inc. that designs, develops, and maintains the cryoSPARC software system in which this project was implemented and tested. Resources used in this research were provided, in part, by the Province of Ontario, the Government of Canada through NSERC and CIFAR, and companies sponsoring the Vector Institute.Author Contributions: A.P. developed the method, created the implementation and performed experiments.A.P. and D.J.F. created the formulation and wrote the paper.

Competing Interests: A.P. is CEO of Stuctura Biotechnology Inc. which builds the cryoSPARC software package, distributed freely for academic non-profit use with software licenses available for commercial use. D.J.F. is an advisor to Stuctura Biotechnology Inc. The novel aspects of the method presented are described in a provisional patent application.